Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Amit Kumar Wed, 30 Nov 2011 23:03:16 -0800

Hi Pablo,
Thanks for your valuable input. I got the Mediawiki think working and am able 
to run the abstract extractor as well.


The extraction framework works well for a small sample dataset e.g 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
 which
has around 6300 entries. But when I try to run the framework on the full 
wikipedia data(en, around 33GB uncompressed) I get  java heap space errors.

--------------------------------------
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
        at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
        at java.lang.StringBuilder.<init>(StringBuilder.java:80)
        at scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43)
        at scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48)
        at 
org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:48)
        at 
org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:34)
        at scala.collection.Iterator$class.foreach(Iterator.scala:652)
        at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333)
        at 
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41)
        at 
scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80)
        at 
org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34)
Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space
        at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
        at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
        at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
        at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
        at 
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
        at 
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
        at scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
        at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
        at 
org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:35)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
        at 
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at 
org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
        at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
        at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
        at scala.Option.foreach(Option.scala:198)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
        at 
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at 
org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
        at 
org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
        at 
org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
        at 
org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
        at 
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:194)


There are several instances of GC overhead limit errors also
Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead limit 
exceeded
        at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
        at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
        at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
        at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
        at 
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
        at 
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
        at scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
        at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
        at 
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at 
org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
        at 
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at 
org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
        at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
        at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
        at scala.Option.foreach(Option.scala:198)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
        at 
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
        at 
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at 
org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
        at 
org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
        at 
org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
        at 
org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
SEVERE: Error reading pages. Shutting down...
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.<init>(String.java:215)
        at java.lang.StringBuffer.toString(StringBuffer.java:585)
        at 
com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:859)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:241)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:203)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:159)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:107)
        at 
org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:87)
        at 
org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scala:40)
        at org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54)
Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run


I’m trying to run this on both a 32 bit and a 64bit machines (dev boxes) but to 
no avail. I’m guessing the default JVM configurations are low for the DEF.
It would be great is someone can tell the minimum memory requirement for the 
extraction framework. I tried giving jvm options such  –Xmx to the ‘mvn 
scala:run’ command, but it seems that the mvn command spawn another processes 
and fails to pass on the flags to the new one. If someone has been able to run 
the framework, could you please share me the details.

Also We are looking into to running  the framework over Hadoop. Has anyone 
tried that yet ? If yes, could you share you experience, also if it is really 
possible to run this on Hadoop without many changes and Hacks.

Thanks
Amit










On 11/23/11 2:42 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote:


Hi Amit,
Thanks for your interest in DBpedia. Most of my effort has gone into DBpedia 
Spotlight, but I can try to help with the DBpedia Extraction Framework as well. 
Maybe the core developers can chip in if I misrepresent somewhere.

1) [more docs]

I am unaware.


2) [typo in config]

Seems ok.


3) ... Am I right ? Does the framework work on any particular dump of 
Wikipedia? Also what goes in the commons branch ?

Yes. As far as I can tell, you're right. But there is no particular dump. You 
just need to follow the convention for the directory structure. The commons 
directory has a similar structure, see:

wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml

I think this file is only used by the image extractor and maybe a couple of 
others. Maybe it should be only mandatory if the corresponding extractors are 
included in the config. But it's likely nobody got around to implementing that 
catch yet.


4) It seems the AbstractExtractor requires an instance of Mediawiki running to 
parse mediawiki syntax. ... Can someone shed some more light on this ? What 
customization is required ? Where can I get one ?

The abstract extractor is used to render inline templates, as many articles 
start with automatically generated content from templates. See:
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction



Also another question: Is there a reason for the delay in subsequent Dbpedia 
releases ? I was wondering , if the code is already there, why does it take 6 
months between Dbpedia releases? Is there a manual editorial  involved or is it 
due  to development/changes  in the framework code which are collated in every 
release?

One reason might be that a lot of the value in DBpedia comes from manually 
generated "homogenization" in mappings.dbpedia.org 
<http://mappings.dbpedia.org> . That, plus getting a stable version of the 
framework tested and run would probably explain the choice of periodicity.

Best,
Pablo

On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <amitk...@yahoo-inc.com> wrote:

Hey everyone,
I’m trying to setup the Dbpedia extraction framework as I’m interested in 
getting structured data from already downloaded wikipedia dumps.  As per my 
understanding  I need to work in the ‘dump’ directory of the codebase. I have 
tried to reverse engineer ( given scala is new for me) but I need some help.


 1.  First of all, is there a more detailed documentation somewhere about 
setting and running the pipeline. The one available on dbpedia.org 
<http://dbpedia.org>  seems insufficient.
 2.  I understand that I need to create a config.properties file first where I 
need to setup input/output locations, list of extractors and the languages. I 
tried working with the config.properties.default given in the code. There seems 
to be some typo in the extractor list. 
‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’ using 
this gives ‘class not found’ error. I converted it to 
‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ?
 3.  I can’t find the documentation on how to setup the input directory. Can 
someone tell me the details? From what I gather, input directory should contain 
a ‘commons’ directory plus, directory for all languages set in 
config.properties. All these directories must have a subdirectory whose name 
should be of YYYYMMDD format. Within that you save the xml files such as 
enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work on any 
particular dump of Wikipedia? Also what goes in the commons branch ?
 4.  I ran the framework by copying a sample dump  
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
 in both en and commons branch. Unzipping them and renaming as per requirement. 
For now I’m working with en language only. It works with the default 19 
extractors but, starts failing if I include AbstractExtractor. It seems the 
AbstractExtractor requires an instance of Mediawiki running to parse mediawiki 
syntax. From the file itself, “DBpedia-customized MediaWiki instance is 
required.” Can someone shed some more light on this ? What customization is 
required ? Where can I get one ?


Sorry if the question are too basic and already mentioned somewhere. I have 
tried looking but couldn’t find myself.
Also another question: Is there a reason for the delay in subsequent Dbpedia 
releases ? I was wondering , if the code is already there, why does it take 6 
months between Dbpedia releases? Is there a manual editorial  involved or is it 
due  to development/changes  in the framework code which are collated in every 
release?


Thanks and regards,

Amit
Tech Lead
Cloud and Platform Group
Yahoo!

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Reply via email to