I have successfully compiled the extraction-framework and run the download
for the English Wikipedia.

However, when I run the extraction, I have the following error:
################################################################
....
Caused by: java.io.IOException: failed to list files in
[E:\project\gsoc2014\wik
ipedia\commonswiki]
        at org.dbpedia.extraction.util.RichFile.names(RichFile.scala:44)
        at org.dbpedia.extraction.util.RichFile.names(RichFile.scala:39)
        at org.dbpedia.extraction.util.Finder.dates(Finder.scala:52)
        at
org.dbpedia.extraction.dump.extract.ConfigLoader.latestDate(ConfigLoa
der.scala:196)
....
[INFO]
------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 03:46 h
[INFO] Finished at: 2014-03-15T06:31:41-05:00
[INFO] Final Memory: 10M/231M
[INFO]
------------------------------------------------------------------------
[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.1.6:run (
default-cli) on project dump: wrap:
org.apache.commons.exec.ExecuteException: Pr
ocess exited with an error: -10000 (Exit value: -10000) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
swit
ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please rea
d the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE
xception
###########################################################

After it, there is only one output file under the dataset folder:
enwiki-20140304-template-redirects.obj


In addition, I used the following config parameters for the extraction:
base-dir=E:/project/gsoc2014/wikipedia
source=pages-articles.xml.bz2
languages=en

extractors.en=.MappingExtractor,.DisambiguationExtractor,.HomepageExtractor,.ImageExtractor,\
.PersondataExtractor,.PndExtractor,.TopicalConceptsExtractor,.FlickrWrapprLinkExtractor

Here are my questions:
1. Does different languages have different extractors?
2. Is the default source parameter "pages-articles.xml.bz2"? When I didn't
include this line, I will have an exception saying **pages-articles.xml not
found.
3. How many hours does it take to run the extractor for only the english
and for all the languages?
4. How many disk space do I need to store all the data?
5. How can I debug an extractor? Testing on the whole Wikipedia dump is
impossible when debugging. It is too slow.


-- 
Wencan Luo
CS Department- Univ. of Pittsburgh
210 S. Bouquet Street
6501 Sennott Square
Pittsburgh, PA 15260
E-mail: wencanluo...@gmail.com or wen...@cs.pitt.edu
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to