Hi all, a bug in TableMapping caused these memory problems:
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/rev/77ba9bd157f8 For now I just commented them out, I'll try to actually fix them later. Regards, Christopher On Fri, Dec 30, 2011 at 17:20, Max Jakob <max.ja...@gmail.com> wrote: > Hi Amit, > > sorry for weighing in late on the subject, but I have a suspicion. > > The DBpedia parser that parses the MediaWiki markup is not great at > parsing tables. The reason is that the focus previously was on > extracting information from infoboxes. > > The MappingExtractor class is the only one that attempts to use table > mappings. If there are mis-parsed table structures, this could lead to > infinite loops. > > All three pages that you mentioned contain tables. There might be > syntactical constructions that the parser can't cope with at the > moment. > > If you are able to track down the bug, I would be tremendously helpful > if you could fix it. > > Best regards, > Max > > On Wed, Dec 14, 2011 at 08:40, Amit Kumar <amitk...@yahoo-inc.com> wrote: >> Hi Pablo, >> I have narrowed down the memory issue that I had been facing. After going >> unsuccessfully through the whole enwiki dump, I ran the DEF on a smaller >> dump. I picked >> >> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles27.xml-p029625017p033928886.bz2. >> After multiple runs and experiments, we found that there are three pages >> in >> the dump where the DEF sort of got stuck and the heap overshoot any limit >> you give. The three pages are >> >> http://en.wikipedia.org/wiki/Mikoyan-Gurevich_MiG-21_variants >> http://en.wikipedia.org/wiki/List_of_fastest_production_motorcycles >> http://en.wikipedia.org/wiki/Chevrolet_small-block_engine_table >> >> >> If you skip these three pages (by skipping in dump/.. >> .../ExtractionJob.scala) the framework run successfully. On further >> research >> I found that its only the MappingExtractor which is causing the problem. >> Once you remove that from config.properties file everything works. >> >> So from what we know, among 1.5 Million approx pages in the smaller dump, >> the MappingExtractor fails on these three pages, taking the whole JVM with >> it. I’m attaching three xml (1 wiki page each). Out of these the DEF would >> only run on India.xml, for the other two it would keep failing unless you >> remove the MappingExtractor. There is something about these above 3 pages >> that is not normal(there would be more in the complete wikipedia dump). >> From >> the src file it looks like mappingextractor works on extracting data from >> infoboxes and interestingly none of the three pages have a infobox in it. >> Could this be a reason ? >> >> Can someone please look into this. I’m wondering how you guys were able to >> generate the 3.7 dbepdia dump. Did you skip the MappingExtractor. Or is >> it >> that the problems in the pages got introduced after the 3.7 run. If this >> is >> the case we would need to fix this as it would definitely fail during the >> next release. >> >> Thanks and Regards >> Amit >> >> >> >> >> On 12/1/11 4:47 PM, "Amit X Kumar" <amitk...@yahoo-inc.com> wrote: >> >> Hi Pablo, >> I figured this out just after sending my email. I’m experimenting with >> some >> values right now. I’ll let you know if I get it to work. In the meanwhile, >> if some one already has the working values, it would be a big help. >> >> Plus do you know anyone running the DEF on Hadoop ? >> >> Thanks >> Amit >> >> On 12/1/11 4:39 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote: >> >> Hi Amit, >> >>> "I tried giving jvm options such –Xmx to the ‘mvn scala:run’ command, >>> but >>> it seems that the mvn command spawn another processes and fails to pass >>> on >>> the flags to the new one. If someone has been able to run the framework, >>> could you please share me the details." >> >> The easiest way to get it working is probably to change the value in the >> dump/pom.xml here: >> >> <launcher> >> <id>Extract</id> >> >> <mainClass>org.dbpedia.extraction.dump.Extract</mainClass> >> <jvmArgs> >> <jvmArg>-Xmx1024m</jvmArg> >> </jvmArgs> >> </launcher> >> >> >> Cheers, >> Pablo >> >> On Thu, Dec 1, 2011 at 8:01 AM, Amit Kumar <amitk...@yahoo-inc.com> wrote: >> >> >> Hi Pablo, >> Thanks for your valuable input. I got the Mediawiki think working and am >> able to run the abstract extractor as well. >> >> The extraction framework works well for a small sample dataset e.g >> >> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 >> which >> has around 6300 entries. But when I try to run the framework on the full >> wikipedia data(en, around 33GB uncompressed) I get java heap space >> errors. >> >> -------------------------------------- >> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space >> at >> java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45) >> at java.lang.StringBuilder.<init>(StringBuilder.java:80) >> at >> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43) >> at >> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48) >> at >> >> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:48) >> at >> >> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:34) >> at scala.collection.Iterator$class.foreach(Iterator.scala:652) >> at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333) >> at >> >> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41) >> at >> >> scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80) >> at >> org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34) >> Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space >> at >> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120) >> at >> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42) >> at >> >> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) >> at >> >> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >> at >> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128) >> at >> scala.collection.immutable.List.$colon$colon$colon(List.scala:78) >> at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26) >> at >> >> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:35) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> >> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64) >> at scala.Option.foreach(Option.scala:198) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> >> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63) >> at >> >> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24) >> at >> >> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) >> at >> >> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) >> at >> >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) >> at >> >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> scala.collection.TraversableLike$class.map(TraversableLike.scala:194) >> >> >> There are several instances of GC overhead limit errors also >> Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead >> limit >> exceeded >> at >> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120) >> at >> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42) >> at >> >> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) >> at >> >> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >> at >> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128) >> at >> scala.collection.immutable.List.$colon$colon$colon(List.scala:78) >> at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> >> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> >> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64) >> at scala.Option.foreach(Option.scala:198) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64) >> at >> >> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63) >> at >> >> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) >> at scala.collection.immutable.List.foreach(List.scala:45) >> at >> >> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63) >> at >> >> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24) >> at >> >> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) >> at >> >> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) >> at >> >> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) >> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run >> SEVERE: Error reading pages. Shutting down... >> java.lang.OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOfRange(Arrays.java:3209) >> at java.lang.String.<init>(String.java:215) >> at java.lang.StringBuffer.toString(StringBuffer.java:585) >> at >> >> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:859) >> at >> >> org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:241) >> at >> >> org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:203) >> at >> >> org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:159) >> at >> >> org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:107) >> at >> >> org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:87) >> at >> >> org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scala:40) >> at >> org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54) >> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run >> >> >> I’m trying to run this on both a 32 bit and a 64bit machines (dev boxes) >> but >> to no avail. I’m guessing the default JVM configurations are low for the >> DEF. >> It would be great is someone can tell the minimum memory requirement for >> the >> extraction framework. I tried giving jvm options such –Xmx to the ‘mvn >> scala:run’ command, but it seems that the mvn command spawn another >> processes and fails to pass on the flags to the new one. If someone has >> been >> able to run the framework, could you please share me the details. >> >> Also We are looking into to running the framework over Hadoop. Has anyone >> tried that yet ? If yes, could you share you experience, also if it is >> really possible to run this on Hadoop without many changes and Hacks. >> >> Thanks >> Amit >> >> >> >> >> >> >> >> >> >> >> >> On 11/23/11 2:42 PM, "Pablo Mendes" <pablomen...@gmail.com >> <http://pablomen...@gmail.com> > wrote: >> >> >> Hi Amit, >> Thanks for your interest in DBpedia. Most of my effort has gone into >> DBpedia >> Spotlight, but I can try to help with the DBpedia Extraction Framework as >> well. Maybe the core developers can chip in if I misrepresent somewhere. >> >> 1) [more docs] >> >> >> I am unaware. >> >> >> 2) [typo in config] >> >> >> Seems ok. >> >> >> 3) ... Am I right ? Does the framework work on any particular dump of >> Wikipedia? Also what goes in the commons branch ? >> >> >> Yes. As far as I can tell, you're right. But there is no particular dump. >> You just need to follow the convention for the directory structure. The >> commons directory has a similar structure, see: >> >> wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml >> >> I think this file is only used by the image extractor and maybe a couple >> of >> others. Maybe it should be only mandatory if the corresponding extractors >> are included in the config. But it's likely nobody got around to >> implementing that catch yet. >> >> >> 4) It seems the AbstractExtractor requires an instance of Mediawiki >> running >> to parse mediawiki syntax. ... Can someone shed some more light on this ? >> What customization is required ? Where can I get one ? >> >> >> The abstract extractor is used to render inline templates, as many >> articles >> start with automatically generated content from templates. See: >> >> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction >> >> >> >> Also another question: Is there a reason for the delay in subsequent >> Dbpedia >> releases ? I was wondering , if the code is already there, why does it >> take >> 6 months between Dbpedia releases? Is there a manual editorial involved >> or >> is it due to development/changes in the framework code which are >> collated >> in every release? >> >> >> One reason might be that a lot of the value in DBpedia comes from manually >> generated "homogenization" in mappings.dbpedia.org >> <http://mappings.dbpedia.org> <http://mappings.dbpedia.org> . That, plus >> getting a stable version of the framework tested and run would probably >> explain the choice of periodicity. >> >> >> Best, >> Pablo >> >> On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <amitk...@yahoo-inc.com >> <http://amitk...@yahoo-inc.com> > wrote: >> >> >> Hey everyone, >> I’m trying to setup the Dbpedia extraction framework as I’m interested in >> getting structured data from already downloaded wikipedia dumps. As per >> my >> understanding I need to work in the ‘dump’ directory of the codebase. I >> have tried to reverse engineer ( given scala is new for me) but I need >> some >> help. >> >> First of all, is there a more detailed documentation somewhere about >> setting >> and running the pipeline. The one available on dbpedia.org >> <http://dbpedia.org> <http://dbpedia.org> seems insufficient. >> I understand that I need to create a config.properties file first where I >> need to setup input/output locations, list of extractors and the >> languages. >> I tried working with the config.properties.default given in the code. >> There >> seems to be some typo in the extractor list. >> ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’ >> using >> this gives ‘class not found’ error. I converted it to >> ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ? >> I can’t find the documentation on how to setup the input directory. Can >> someone tell me the details? From what I gather, input directory should >> contain a ‘commons’ directory plus, directory for all languages set in >> config.properties. All these directories must have a subdirectory whose >> name >> should be of YYYYMMDD format. Within that you save the xml files such as >> enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work >> on >> any particular dump of Wikipedia? Also what goes in the commons branch ? >> I ran the framework by copying a sample dump >> >> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 >> in both en and commons branch. Unzipping them and renaming as per >> requirement. For now I’m working with en language only. It works with the >> default 19 extractors but, starts failing if I include AbstractExtractor. >> It >> seems the AbstractExtractor requires an instance of Mediawiki running to >> parse mediawiki syntax. From the file itself, “DBpedia-customized >> MediaWiki >> instance is required.” Can someone shed some more light on this ? What >> customization is required ? Where can I get one ? >> >> >> >> Sorry if the question are too basic and already mentioned somewhere. I >> have >> tried looking but couldn’t find myself. >> Also another question: Is there a reason for the delay in subsequent >> Dbpedia >> releases ? I was wondering , if the code is already there, why does it >> take >> 6 months between Dbpedia releases? Is there a manual editorial involved >> or >> is it due to development/changes in the framework code which are >> collated >> in every release? >> >> >> Thanks and regards, >> >> Amit >> Tech Lead >> Cloud and Platform Group >> Yahoo! >> >> >> ------------------------------------------------------------------------------ >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> Dbpedia-discussion mailing list >> Dbpedia-discussion@lists.sourceforge.net >> <http://Dbpedia-discussion@lists.sourceforge.net> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> Cloud Computing - Latest Buzzword or a Glimpse of the Future? >> This paper surveys cloud computing today: What are the benefits? >> Why are businesses embracing it? What are its payoffs and pitfalls? >> http://www.accelacomm.com/jaw/sdnl/114/51425149/ >> _______________________________________________ >> Dbpedia-discussion mailing list >> Dbpedia-discussion@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >> > ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion