Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Pablo Mendes Thu, 01 Dec 2011 03:11:06 -0800

Hi Amit,

> "I tried giving jvm options such  –Xmx to the ‘mvn scala:run’ command,
but it seems that the mvn command spawn another processes and fails to pass
on the flags to the new one. If someone has been able to run the framework,
could you please share me the details."


The easiest way to get it working is probably to change the value in the
dump/pom.xml here:

                       <launcher>
                            <id>Extract</id>

<mainClass>org.dbpedia.extraction.dump.Extract</mainClass>
                            <jvmArgs>
                                <jvmArg>-Xmx1024m</jvmArg>
                            </jvmArgs>
                        </launcher>


Cheers,
Pablo

On Thu, Dec 1, 2011 at 8:01 AM, Amit Kumar <amitk...@yahoo-inc.com> wrote:

>
> Hi Pablo,
> Thanks for your valuable input. I got the Mediawiki think working and am
> able to run the abstract extractor as well.
>
> The extraction framework works well for a small sample dataset e.g
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2which
> has around 6300 entries. But when I try to run the framework on the full
> wikipedia data(en, around 33GB uncompressed) I get  java heap space errors.
>
> --------------------------------------
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
>         at
> java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
>         at java.lang.StringBuilder.<init>(StringBuilder.java:80)
>         at
> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43)
>         at
> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48)
>         at
> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:48)
>         at
> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:34)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:652)
>         at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333)
>         at
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41)
>         at
> scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80)
>         at
> org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34)
> Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space
>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
>         at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>         at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>         at
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
>         at
> scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
>         at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
>         at
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:35)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>         at
>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
>         at
>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
>         at scala.Option.foreach(Option.scala:198)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
>         at
> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
>         at
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>         at
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:194)
>
>
> There are several instances of GC overhead limit errors also
> Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
>         at
> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
>         at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>         at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>         at
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
>         at
> scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
>         at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>         at
>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
>         at
>
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
>         at scala.Option.foreach(Option.scala:198)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
>         at
> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
>         at
> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>         at scala.collection.immutable.List.foreach(List.scala:45)
>         at
> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
>         at
> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
>         at
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>         at
> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
> SEVERE: Error reading pages. Shutting down...
> java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOfRange(Arrays.java:3209)
>         at java.lang.String.<init>(String.java:215)
>         at java.lang.StringBuffer.toString(StringBuffer.java:585)
>         at
> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:859)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:241)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:203)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:159)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:107)
>         at
> org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:87)
>         at
> org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scala:40)
>         at
> org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54)
> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
>
>
> I’m trying to run this on both a 32 bit and a 64bit machines (dev boxes)
> but to no avail. I’m guessing the default JVM configurations are low for
> the DEF.
> It would be great is someone can tell the minimum memory requirement for
> the extraction framework. I tried giving jvm options such  –Xmx to the ‘mvn
> scala:run’ command, but it seems that the mvn command spawn another
> processes and fails to pass on the flags to the new one. If someone has
> been able to run the framework, could you please share me the details.
>
> Also We are looking into to running  the framework over Hadoop. Has anyone
> tried that yet ? If yes, could you share you experience, also if it is
> really possible to run this on Hadoop without many changes and Hacks.
>
> Thanks
> Amit
>
>
>
>
>
>
>
>
>
>
>
> On 11/23/11 2:42 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote:
>
>
> Hi Amit,
> Thanks for your interest in DBpedia. Most of my effort has gone into
> DBpedia Spotlight, but I can try to help with the DBpedia Extraction
> Framework as well. Maybe the core developers can chip in if I misrepresent
> somewhere.
>
> 1) [more docs]
>
>
> I am unaware.
>
>
> 2) [typo in config]
>
>
> Seems ok.
>
>
> 3) ... Am I right ? Does the framework work on any particular dump of
> Wikipedia? Also what goes in the commons branch ?
>
>
> Yes. As far as I can tell, you're right. But there is no particular dump.
> You just need to follow the convention for the directory structure. The
> commons directory has a similar structure, see:
>
> wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml
>
> I think this file is only used by the image extractor and maybe a couple
> of others. Maybe it should be only mandatory if the corresponding
> extractors are included in the config. But it's likely nobody got around to
> implementing that catch yet.
>
>
> 4) It seems the AbstractExtractor requires an instance of Mediawiki
> running to parse mediawiki syntax. ... Can someone shed some more light on
> this ? What customization is required ? Where can I get one ?
>
>
> The abstract extractor is used to render inline templates, as many
> articles start with automatically generated content from templates. See:
>
> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction
>
>
>
> Also another question: Is there a reason for the delay in subsequent
> Dbpedia releases ? I was wondering , if the code is already there, why does
> it take 6 months between Dbpedia releases? Is there a manual editorial
>  involved or is it due  to development/changes  in the framework code which
> are collated in every release?
>
>
> One reason might be that a lot of the value in DBpedia comes from manually
> generated "homogenization" in mappings.dbpedia.org <
> http://mappings.dbpedia.org> . That, plus getting a stable version of the
> framework tested and run would probably explain the choice of periodicity.
>
>
> Best,
> Pablo
>
> On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <amitk...@yahoo-inc.com>
> wrote:
>
>
> Hey everyone,
> I’m trying to setup the Dbpedia extraction framework as I’m interested in
> getting structured data from already downloaded wikipedia dumps.  As per my
> understanding  I need to work in the ‘dump’ directory of the codebase. I
> have tried to reverse engineer ( given scala is new for me) but I need some
> help.
>
>
>    1. First of all, is there a more detailed documentation somewhere
>    about setting and running the pipeline. The one available on
>    dbpedia.org <http://dbpedia.org>  seems insufficient.
>    2. I understand that I need to create a config.properties file first
>    where I need to setup input/output locations, list of extractors and the
>    languages. I tried working with the config.properties.default given in the
>    code. There seems to be some typo in the extractor list.
>    ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’
>    using this gives ‘class not found’ error. I converted it to
>    ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ?
>    3. I can’t find the documentation on how to setup the input directory.
>    Can someone tell me the details? From what I gather, input directory should
>    contain a ‘commons’ directory plus, directory for all languages set in
>    config.properties. All these directories must have a subdirectory whose
>    name should be of YYYYMMDD format. Within that you save the xml files such
>    as enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work
>    on any particular dump of Wikipedia? Also what goes in the commons branch ?
>    4. I ran the framework by copying a sample dump
>    
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2in
>  both en and commons branch. Unzipping them and renaming as per
>    requirement. For now I’m working with en language only. It works with the
>    default 19 extractors but, starts failing if I include *
>    AbstractExtractor.* It seems the AbstractExtractor requires an
>    instance of Mediawiki running to parse mediawiki syntax. From the file
>    itself, “*DBpedia-customized MediaWiki instance is required*.” Can
>    someone shed some more light on this ? What customization is required ?
>    Where can I get one ?
>
>
>
> Sorry if the question are too basic and already mentioned somewhere. I
> have tried looking but couldn’t find myself.
> Also another question: Is there a reason for the delay in subsequent
> Dbpedia releases ? I was wondering , if the code is already there, why does
> it take 6 months between Dbpedia releases? Is there a manual editorial
>  involved or is it due  to development/changes  in the framework code which
> are collated in every release?
>
>
> Thanks and regards,
>
> Amit
> Tech Lead
> Cloud and Platform Group
> Yahoo!
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Reply via email to