Hi Pablo,
Thanks for your valuable input. I got the Mediawiki think working and am able
to run the abstract extractor as well.
The extraction framework works well for a small sample dataset e.g
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
which
has around 6300 entries. But when I try to run the framework on the full
wikipedia data(en, around 33GB uncompressed) I get java heap space errors.
--------------------------------------
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.<init>(StringBuilder.java:80)
at scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43)
at scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48)
at
org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:48)
at
org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:34)
at scala.collection.Iterator$class.foreach(Iterator.scala:652)
at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333)
at
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41)
at
scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80)
at
org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34)
Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
at scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
at
org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:35)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at
org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
at scala.Option.foreach(Option.scala:198)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at
org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
at
org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
at
org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
at
org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:194)
There are several instances of GC overhead limit errors also
Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead limit
exceeded
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
at scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at
org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at
org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
at scala.Option.foreach(Option.scala:198)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
at
org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:45)
at
org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
at
org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
at
org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
at
org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
SEVERE: Error reading pages. Shutting down...
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.<init>(String.java:215)
at java.lang.StringBuffer.toString(StringBuffer.java:585)
at
com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:859)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:241)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:203)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:159)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:107)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:87)
at
org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scala:40)
at org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54)
Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
I’m trying to run this on both a 32 bit and a 64bit machines (dev boxes) but to
no avail. I’m guessing the default JVM configurations are low for the DEF.
It would be great is someone can tell the minimum memory requirement for the
extraction framework. I tried giving jvm options such –Xmx to the ‘mvn
scala:run’ command, but it seems that the mvn command spawn another processes
and fails to pass on the flags to the new one. If someone has been able to run
the framework, could you please share me the details.
Also We are looking into to running the framework over Hadoop. Has anyone
tried that yet ? If yes, could you share you experience, also if it is really
possible to run this on Hadoop without many changes and Hacks.
Thanks
Amit
On 11/23/11 2:42 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote:
Hi Amit,
Thanks for your interest in DBpedia. Most of my effort has gone into DBpedia
Spotlight, but I can try to help with the DBpedia Extraction Framework as well.
Maybe the core developers can chip in if I misrepresent somewhere.
1) [more docs]
I am unaware.
2) [typo in config]
Seems ok.
3) ... Am I right ? Does the framework work on any particular dump of
Wikipedia? Also what goes in the commons branch ?
Yes. As far as I can tell, you're right. But there is no particular dump. You
just need to follow the convention for the directory structure. The commons
directory has a similar structure, see:
wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml
I think this file is only used by the image extractor and maybe a couple of
others. Maybe it should be only mandatory if the corresponding extractors are
included in the config. But it's likely nobody got around to implementing that
catch yet.
4) It seems the AbstractExtractor requires an instance of Mediawiki running to
parse mediawiki syntax. ... Can someone shed some more light on this ? What
customization is required ? Where can I get one ?
The abstract extractor is used to render inline templates, as many articles
start with automatically generated content from templates. See:
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction
Also another question: Is there a reason for the delay in subsequent Dbpedia
releases ? I was wondering , if the code is already there, why does it take 6
months between Dbpedia releases? Is there a manual editorial involved or is it
due to development/changes in the framework code which are collated in every
release?
One reason might be that a lot of the value in DBpedia comes from manually
generated "homogenization" in mappings.dbpedia.org
<http://mappings.dbpedia.org> . That, plus getting a stable version of the
framework tested and run would probably explain the choice of periodicity.
Best,
Pablo
On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <amitk...@yahoo-inc.com> wrote:
Hey everyone,
I’m trying to setup the Dbpedia extraction framework as I’m interested in
getting structured data from already downloaded wikipedia dumps. As per my
understanding I need to work in the ‘dump’ directory of the codebase. I have
tried to reverse engineer ( given scala is new for me) but I need some help.
1. First of all, is there a more detailed documentation somewhere about
setting and running the pipeline. The one available on dbpedia.org
<http://dbpedia.org> seems insufficient.
2. I understand that I need to create a config.properties file first where I
need to setup input/output locations, list of extractors and the languages. I
tried working with the config.properties.default given in the code. There seems
to be some typo in the extractor list.
‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’ using
this gives ‘class not found’ error. I converted it to
‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ?
3. I can’t find the documentation on how to setup the input directory. Can
someone tell me the details? From what I gather, input directory should contain
a ‘commons’ directory plus, directory for all languages set in
config.properties. All these directories must have a subdirectory whose name
should be of YYYYMMDD format. Within that you save the xml files such as
enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work on any
particular dump of Wikipedia? Also what goes in the commons branch ?
4. I ran the framework by copying a sample dump
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
in both en and commons branch. Unzipping them and renaming as per requirement.
For now I’m working with en language only. It works with the default 19
extractors but, starts failing if I include AbstractExtractor. It seems the
AbstractExtractor requires an instance of Mediawiki running to parse mediawiki
syntax. From the file itself, “DBpedia-customized MediaWiki instance is
required.” Can someone shed some more light on this ? What customization is
required ? Where can I get one ?
Sorry if the question are too basic and already mentioned somewhere. I have
tried looking but couldn’t find myself.
Also another question: Is there a reason for the delay in subsequent Dbpedia
releases ? I was wondering , if the code is already there, why does it take 6
months between Dbpedia releases? Is there a manual editorial involved or is it
due to development/changes in the framework code which are collated in every
release?
Thanks and regards,
Amit
Tech Lead
Cloud and Platform Group
Yahoo!
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion