Hi Tommy, I knew about the MAVEN_OPTS. I tried that but as I mentioned, the flags are not being passed on to the child process being spawned. Turns out its Hardcoded in the pom.xml of dump directory.
I too was thinking of using partial wikipedia files as input. The problem is, The input and output mechanism is sort of hardcoded. It expects a single file per langauge e.g input/en/20111107/enwiki-20111107-pages-articles.xml . Now I have two options. If I don't want to make any changes in the code, I could run the framework multiple times, once for each partial file, but then the outputs would in different folders for each file. Or work on running the framework in a way that it picks all the files in a folder and also collate the outputs in a single place. But this would entail changes in code. Is there a simple way in the Dbpedia Extraction Framework itself to pick multiple files in one directory and collate the results. I can't seem to find it. As per my understanding I would need to change the ConfigLoader class. Have you either of this ? Thanks and Regards, Amit On 12/1/11 11:22 PM, "Tommy Chheng" <tommy.chh...@gmail.com> wrote: > When using mvn scala:run, use MAVEN_OPTS=-Xmx rather than JAVA_OPTS > > The dump also comes in 27 files rather than one big one. You can use > these alternatively. > > -- > @tommychheng > qwiki.com > > > On Wed, Nov 30, 2011 at 11:01 PM, Amit Kumar <amitk...@yahoo-inc.com> wrote: >> >> Hi Pablo, >> Thanks for your valuable input. I got the Mediawiki think working and am >> able to run the abstract extractor as well. >> >> The extraction framework works well for a small sample dataset e.g >> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p0 >> 00000010p000010000.bz2 >> which >> has around 6300 entries. But when I try to run the framework on the full >> wikipedia data(en, around 33GB uncompressed) I get java heap space errors. >> >> -------------------------------------- >> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space >> at >> java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45) >> at java.lang.StringBuilder.<init>(StringBuilder.java:80) >> at >> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43) >> at >> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48) >> at >> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Ext >> ract.scala:48) >> at >> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Ext >> ract.scala:34) >> at scala.collection.Iterator$class.foreach(Iterator.scala:652) >> at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333) >> at >> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike. >> scala:41) >> at >> scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80) >> at >> org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34) >> Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space >> at >> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120) >> at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42) >> at >> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.sca >> la:48) >> at >> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.sca >> la:48) >> at >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>> ) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >> at >> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128) >> at scala.collection.immutable.List.$colon$colon$colon(List.scala:78) >> at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26) >> at >> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:3 >> 5) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab >> leMapping.scala:39) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab >> leMapping.scala:37) >> at >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>> ) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:3 >> 7) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun >> $apply$4.apply(TableMapping.scala:73) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun >> $apply$4.apply(TableMapping.scala:64) >> at scala.Option.foreach(Option.scala:198) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(Ta >> bleMapping.scala:64) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(Ta >> bleMapping.scala:63) >> at >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>> ) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala: >> 63) >> at >> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24) >> at >> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtr >> actor.scala:47) >> at >> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtr >> actor.scala:47) >> at >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:1 >> 94) >> at >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:1 >> 94) >> at >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>> ) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> scala.collection.TraversableLike$class.map(TraversableLike.scala:194) >> >> >> There are several instances of GC overhead limit errors also >> Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead limit >> exceeded >> at >> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120) >> at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42) >> at >> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.sca >> la:48) >> at >> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.sca >> la:48) >> at >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>> ) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >> at >> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128) >> at scala.collection.immutable.List.$colon$colon$colon(List.scala:78) >> at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab >> leMapping.scala:39) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab >> leMapping.scala:37) >> at >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>> ) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:3 >> 7) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab >> leMapping.scala:39) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(Tab >> leMapping.scala:37) >> at >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>> ) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:3 >> 7) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun >> $apply$4.apply(TableMapping.scala:73) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun >> $apply$4.apply(TableMapping.scala:64) >> at scala.Option.foreach(Option.scala:198) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(Ta >> bleMapping.scala:64) >> at >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(Ta >> bleMapping.scala:63) >> at >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59>> ) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala: >> 63) >> at >> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24) >> at >> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtr >> actor.scala:47) >> at >> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtr >> actor.scala:47) >> at >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:1 >> 94) >> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run >> SEVERE: Error reading pages. Shutting down... >> java.lang.OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOfRange(Arrays.java:3209) >> at java.lang.String.<init>(String.java:215) >> at java.lang.StringBuffer.toString(StringBuffer.java:585) >> at >> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XM >> LStreamReaderImpl.java:859) >> at >> org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDump >> Parser.java:241) >> at >> org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpPars >> er.java:203) >> at >> org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpPar >> ser.java:159) >> at >> org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpPars >> er.java:107) >> at >> org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.ja >> va:87) >> at >> org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scal >> a:40) >> at >> org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54) >> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run >> >> >> I¹m trying to run this on both a 32 bit and a 64bit machines (dev boxes) but >> to no avail. I¹m guessing the default JVM configurations are low for the >> DEF. >> It would be great is someone can tell the minimum memory requirement for the >> extraction framework. I tried giving jvm options such Xmx to the Œmvn >> scala:run¹ command, but it seems that the mvn command spawn another >> processes and fails to pass on the flags to the new one. If someone has been >> able to run the framework, could you please share me the details. >> >> Also We are looking into to running the framework over Hadoop. Has anyone >> tried that yet ? If yes, could you share you experience, also if it is >> really possible to run this on Hadoop without many changes and Hacks. >> >> Thanks >> Amit >> >> >> >> >> >> >> >> >> >> >> On 11/23/11 2:42 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote: >> >> >> Hi Amit, >> Thanks for your interest in DBpedia. Most of my effort has gone into DBpedia >> Spotlight, but I can try to help with the DBpedia Extraction Framework as >> well. Maybe the core developers can chip in if I misrepresent somewhere. >> >> 1) [more docs] >> >> >> I am unaware. >> >> >> 2) [typo in config] >> >> >> Seems ok. >> >> >> 3) ... Am I right ? Does the framework work on any particular dump of >> Wikipedia? Also what goes in the commons branch ? >> >> >> Yes. As far as I can tell, you're right. But there is no particular dump. >> You just need to follow the convention for the directory structure. The >> commons directory has a similar structure, see: >> >> wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml >> >> I think this file is only used by the image extractor and maybe a couple of >> others. Maybe it should be only mandatory if the corresponding extractors >> are included in the config. But it's likely nobody got around to >> implementing that catch yet. >> >> >> 4) It seems the AbstractExtractor requires an instance of Mediawiki running >> to parse mediawiki syntax. ... Can someone shed some more light on this ? >> What customization is required ? Where can I get one ? >> >> >> The abstract extractor is used to render inline templates, as many articles >> start with automatically generated content from templates. See: >> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abs >> tractExtraction >> >> >> >> Also another question: Is there a reason for the delay in subsequent Dbpedia >> releases ? I was wondering , if the code is already there, why does it take >> 6 months between Dbpedia releases? Is there a manual editorial involved or >> is it due to development/changes in the framework code which are collated >> in every release? >> >> >> One reason might be that a lot of the value in DBpedia comes from manually >> generated "homogenization" in mappings.dbpedia.org >> <http://mappings.dbpedia.org> . That, plus getting a stable version of the >> framework tested and run would probably explain the choice of periodicity. >> >> Best, >> Pablo >> >> On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <amitk...@yahoo-inc.com> wrote: >> >> >> Hey everyone, >> I¹m trying to setup the Dbpedia extraction framework as I¹m interested in >> getting structured data from already downloaded wikipedia dumps. As per my >> understanding I need to work in the Œdump¹ directory of the codebase. I >> have tried to reverse engineer ( given scala is new for me) but I need some >> help. >> >> First of all, is there a more detailed documentation somewhere about setting >> and running the pipeline. The one available on dbpedia.org >> <http://dbpedia.org> seems insufficient. >> I understand that I need to create a config.properties file first where I >> need to setup input/output locations, list of extractors and the languages. >> I tried working with the config.properties.default given in the code. There >> seems to be some typo in the extractor list. >> Œorg.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor¹ using >> this gives Œclass not found¹ error. I converted it to >> Œorg.dbpedia.extraction.mappings.InterLanguageLinksExtractor¹. Is it ok ? >> I can¹t find the documentation on how to setup the input directory. Can >> someone tell me the details? From what I gather, input directory should >> contain a Œcommons¹ directory plus, directory for all languages set in >> config.properties. All these directories must have a subdirectory whose name >> should be of YYYYMMDD format. Within that you save the xml files such as >> enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work on >> any particular dump of Wikipedia? Also what goes in the commons branch ? >> I ran the framework by copying a sample dump >> >> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p0 >> 00000010p000010000.bz2 >> in both en and commons branch. Unzipping them and renaming as per >> requirement. For now I¹m working with en language only. It works with the >> default 19 extractors but, starts failing if I include AbstractExtractor. It >> seems the AbstractExtractor requires an instance of Mediawiki running to >> parse mediawiki syntax. From the file itself, ³DBpedia-customized MediaWiki >> instance is required.² Can someone shed some more light on this ? What >> customization is required ? Where can I get one ? >> >> >> >> Sorry if the question are too basic and already mentioned somewhere. I have >> tried looking but couldn¹t find myself. >> Also another question: Is there a reason for the delay in subsequent Dbpedia >> releases ? I was wondering , if the code is already there, why does it take >> 6 months between Dbpedia releases? Is there a manual editorial involved or >> is it due to development/changes in the framework code which are collated >> in every release? >> >> >> Thanks and regards, >> >> Amit >> Tech Lead >> Cloud and Platform Group >> Yahoo! >> >> ----------------------------------------------------------------------------->> - >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> Dbpedia-discussion mailing list >> Dbpedia-discussion@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >> >> >> >> >> ----------------------------------------------------------------------------->> - >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> Dbpedia-discussion mailing list >> Dbpedia-discussion@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >> > > > > -- > @tommychheng > http://tommy.chheng.com ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion