Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Jona Christopher Sahnwaldt Sun, 11 Mar 2012 04:57:55 -0700

Hi all,

a bug in TableMapping caused these memory problems:


http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/rev/77ba9bd157f8

For now I just commented them out,
I'll try to actually fix them later.

Regards,
Christopher

On Fri, Dec 30, 2011 at 17:20, Max Jakob <max.ja...@gmail.com> wrote:
> Hi Amit,
>
> sorry for weighing in late on the subject, but I have a suspicion.
>
> The DBpedia parser that parses the MediaWiki markup is not great at
> parsing tables. The reason is that the focus previously was on
> extracting information from infoboxes.
>
> The MappingExtractor class is the only one that attempts to use table
> mappings. If there are mis-parsed table structures, this could lead to
> infinite loops.
>
> All three pages that you mentioned contain tables. There might be
> syntactical constructions that the parser can't cope with at the
> moment.
>
> If you are able to track down the bug, I would be tremendously helpful
> if you could fix it.
>
> Best regards,
> Max
>
> On Wed, Dec 14, 2011 at 08:40, Amit Kumar <amitk...@yahoo-inc.com> wrote:
>> Hi Pablo,
>> I have narrowed down the memory issue that I had been facing. After going
>> unsuccessfully through the  whole enwiki dump, I ran the DEF on a smaller
>> dump. I  picked
>>
>> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles27.xml-p029625017p033928886.bz2.
>> After multiple runs and experiments, we found that there are three pages
>> in
>> the dump where the DEF sort of got stuck and the heap overshoot any limit
>> you give. The three pages are
>>
>> http://en.wikipedia.org/wiki/Mikoyan-Gurevich_MiG-21_variants
>> http://en.wikipedia.org/wiki/List_of_fastest_production_motorcycles
>> http://en.wikipedia.org/wiki/Chevrolet_small-block_engine_table
>>
>>
>> If you skip these three pages (by skipping in  dump/..
>> .../ExtractionJob.scala) the framework run successfully. On further
>> research
>> I found that its only the MappingExtractor which is causing the problem.
>> Once you remove that from config.properties file everything works.
>>
>> So from what we know,  among 1.5 Million approx pages in the smaller dump,
>> the MappingExtractor fails on these three pages, taking the whole JVM with
>> it. I’m attaching three xml (1 wiki page each). Out of these the DEF would
>> only run on India.xml, for the other two it would keep failing unless you
>> remove the MappingExtractor. There is something about these above 3 pages
>> that is not normal(there would be more in the complete wikipedia dump).
>> From
>> the src file it looks like mappingextractor works on extracting data from
>> infoboxes and interestingly none of the three pages have a infobox in it.
>> Could this be a reason ?
>>
>> Can someone please look into this. I’m wondering how you guys were able to
>> generate the 3.7 dbepdia dump. Did you skip the MappingExtractor.  Or is
>> it
>> that the problems in the pages  got introduced after the 3.7 run. If this
>> is
>> the case we would need to fix this as it would definitely fail during the
>> next release.
>>
>> Thanks and Regards
>> Amit
>>
>>
>>
>>
>> On 12/1/11 4:47 PM, "Amit X Kumar" <amitk...@yahoo-inc.com> wrote:
>>
>> Hi Pablo,
>> I figured this out just after sending my email. I’m experimenting with
>>  some
>> values right now. I’ll let you know if I get it to work. In the meanwhile,
>> if some one already has the working values, it would be a big help.
>>
>> Plus do you know anyone running the DEF on Hadoop ?
>>
>> Thanks
>> Amit
>>
>> On 12/1/11 4:39 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote:
>>
>> Hi Amit,
>>
>>> "I tried giving jvm options such  –Xmx to the ‘mvn scala:run’ command,
>>> but
>>> it seems that the mvn command spawn another processes and fails to pass
>>> on
>>> the flags to the new one. If someone has been able to run the framework,
>>> could you please share me the details."
>>
>> The easiest way to get it working is probably to change the value in the
>> dump/pom.xml here:
>>
>>                        <launcher>
>>                             <id>Extract</id>
>>
>>                             <mainClass>org.dbpedia.extraction.dump.Extract</mainClass>
>>                             <jvmArgs>
>>                                 <jvmArg>-Xmx1024m</jvmArg>
>>                             </jvmArgs>
>>                         </launcher>
>>
>>
>> Cheers,
>> Pablo
>>
>> On Thu, Dec 1, 2011 at 8:01 AM, Amit Kumar <amitk...@yahoo-inc.com> wrote:
>>
>>
>> Hi Pablo,
>> Thanks for your valuable input. I got the Mediawiki think working and am
>> able to run the abstract extractor as well.
>>
>> The extraction framework works well for a small sample dataset e.g
>>
>> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
>> which
>> has around 6300 entries. But when I try to run the framework on the full
>> wikipedia data(en, around 33GB uncompressed) I get  java heap space
>> errors.
>>
>> --------------------------------------
>> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
>>         at
>> java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
>>         at java.lang.StringBuilder.<init>(StringBuilder.java:80)
>>         at
>> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43)
>>         at
>> scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48)
>>         at
>>
>> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:48)
>>         at
>>
>> org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:34)
>>         at scala.collection.Iterator$class.foreach(Iterator.scala:652)
>>         at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333)
>>         at
>>
>> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41)
>>         at
>>
>> scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80)
>>         at
>> org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34)
>> Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space
>>         at
>> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
>>         at
>> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
>>         at
>>
>> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>>         at
>>
>> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>>         at
>>
>> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>>         at scala.collection.immutable.List.foreach(List.scala:45)
>>         at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>         at
>> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
>>         at
>> scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
>>         at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:35)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>>         at
>>
>> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>>         at scala.collection.immutable.List.foreach(List.scala:45)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
>>         at scala.Option.foreach(Option.scala:198)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
>>         at
>>
>> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>>         at scala.collection.immutable.List.foreach(List.scala:45)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
>>         at
>>
>> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>>         at
>>
>> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>>         at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
>>         at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
>>         at
>>
>> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>>         at scala.collection.immutable.List.foreach(List.scala:45)
>>         at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:194)
>>
>>
>> There are several instances of GC overhead limit errors also
>> Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead
>> limit
>> exceeded
>>         at
>> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120)
>>         at
>> scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42)
>>         at
>>
>> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>>         at
>>
>> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>>         at
>>
>> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>>         at scala.collection.immutable.List.foreach(List.scala:45)
>>         at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>         at
>> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128)
>>         at
>> scala.collection.immutable.List.$colon$colon$colon(List.scala:78)
>>         at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>>         at
>>
>> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>>         at scala.collection.immutable.List.foreach(List.scala:45)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37)
>>         at
>>
>> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>>         at scala.collection.immutable.List.foreach(List.scala:45)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64)
>>         at scala.Option.foreach(Option.scala:198)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63)
>>         at
>>
>> scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>>         at scala.collection.immutable.List.foreach(List.scala:45)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63)
>>         at
>>
>> org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24)
>>         at
>>
>> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>>         at
>>
>> org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47)
>>         at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
>> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
>> SEVERE: Error reading pages. Shutting down...
>> java.lang.OutOfMemoryError: Java heap space
>>         at java.util.Arrays.copyOfRange(Arrays.java:3209)
>>         at java.lang.String.<init>(String.java:215)
>>         at java.lang.StringBuffer.toString(StringBuffer.java:585)
>>         at
>>
>> com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:859)
>>         at
>>
>> org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:241)
>>         at
>>
>> org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:203)
>>         at
>>
>> org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:159)
>>         at
>>
>> org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:107)
>>         at
>>
>> org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:87)
>>         at
>>
>> org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scala:40)
>>         at
>> org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54)
>> Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run
>>
>>
>> I’m trying to run this on both a 32 bit and a 64bit machines (dev boxes)
>> but
>> to no avail. I’m guessing the default JVM configurations are low for the
>> DEF.
>> It would be great is someone can tell the minimum memory requirement for
>> the
>> extraction framework. I tried giving jvm options such  –Xmx to the ‘mvn
>> scala:run’ command, but it seems that the mvn command spawn another
>> processes and fails to pass on the flags to the new one. If someone has
>> been
>> able to run the framework, could you please share me the details.
>>
>> Also We are looking into to running  the framework over Hadoop. Has anyone
>> tried that yet ? If yes, could you share you experience, also if it is
>> really possible to run this on Hadoop without many changes and Hacks.
>>
>> Thanks
>> Amit
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 11/23/11 2:42 PM, "Pablo Mendes" <pablomen...@gmail.com
>> <http://pablomen...@gmail.com> > wrote:
>>
>>
>> Hi Amit,
>> Thanks for your interest in DBpedia. Most of my effort has gone into
>> DBpedia
>> Spotlight, but I can try to help with the DBpedia Extraction Framework as
>> well. Maybe the core developers can chip in if I misrepresent somewhere.
>>
>> 1) [more docs]
>>
>>
>> I am unaware.
>>
>>
>> 2) [typo in config]
>>
>>
>> Seems ok.
>>
>>
>> 3) ... Am I right ? Does the framework work on any particular dump of
>> Wikipedia? Also what goes in the commons branch ?
>>
>>
>> Yes. As far as I can tell, you're right. But there is no particular dump.
>> You just need to follow the convention for the directory structure. The
>> commons directory has a similar structure, see:
>>
>> wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml
>>
>> I think this file is only used by the image extractor and maybe a couple
>> of
>> others. Maybe it should be only mandatory if the corresponding extractors
>> are included in the config. But it's likely nobody got around to
>> implementing that catch yet.
>>
>>
>> 4) It seems the AbstractExtractor requires an instance of Mediawiki
>> running
>> to parse mediawiki syntax. ... Can someone shed some more light on this ?
>> What customization is required ? Where can I get one ?
>>
>>
>> The abstract extractor is used to render inline templates, as many
>> articles
>> start with automatically generated content from templates. See:
>>
>> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction
>>
>>
>>
>> Also another question: Is there a reason for the delay in subsequent
>> Dbpedia
>> releases ? I was wondering , if the code is already there, why does it
>> take
>> 6 months between Dbpedia releases? Is there a manual editorial  involved
>> or
>> is it due  to development/changes  in the framework code which are
>> collated
>> in every release?
>>
>>
>> One reason might be that a lot of the value in DBpedia comes from manually
>> generated "homogenization" in mappings.dbpedia.org
>> <http://mappings.dbpedia.org>  <http://mappings.dbpedia.org> . That, plus
>> getting a stable version of the framework tested and run would probably
>> explain the choice of periodicity.
>>
>>
>> Best,
>> Pablo
>>
>> On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <amitk...@yahoo-inc.com
>> <http://amitk...@yahoo-inc.com> > wrote:
>>
>>
>> Hey everyone,
>> I’m trying to setup the Dbpedia extraction framework as I’m interested in
>> getting structured data from already downloaded wikipedia dumps.  As per
>> my
>> understanding  I need to work in the ‘dump’ directory of the codebase. I
>> have tried to reverse engineer ( given scala is new for me) but I need
>> some
>> help.
>>
>> First of all, is there a more detailed documentation somewhere about
>> setting
>> and running the pipeline. The one available on dbpedia.org
>> <http://dbpedia.org>  <http://dbpedia.org>  seems insufficient.
>> I understand that I need to create a config.properties file first where I
>> need to setup input/output locations, list of extractors and the
>> languages.
>> I tried working with the config.properties.default given in the code.
>> There
>> seems to be some typo in the extractor list.
>> ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’
>> using
>> this gives ‘class not found’ error. I converted it to
>> ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ?
>> I can’t find the documentation on how to setup the input directory. Can
>> someone tell me the details? From what I gather, input directory should
>> contain a ‘commons’ directory plus, directory for all languages set in
>> config.properties. All these directories must have a subdirectory whose
>> name
>> should be of YYYYMMDD format. Within that you save the xml files such as
>> enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work
>> on
>> any particular dump of Wikipedia? Also what goes in the commons branch ?
>> I ran the framework by copying a sample dump
>>
>>  http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
>> in both en and commons branch. Unzipping them and renaming as per
>> requirement. For now I’m working with en language only. It works with the
>> default 19 extractors but, starts failing if I include AbstractExtractor.
>> It
>> seems the AbstractExtractor requires an instance of Mediawiki running to
>> parse mediawiki syntax. From the file itself, “DBpedia-customized
>> MediaWiki
>> instance is required.” Can someone shed some more light on this ? What
>> customization is required ? Where can I get one ?
>>
>>
>>
>> Sorry if the question are too basic and already mentioned somewhere. I
>> have
>> tried looking but couldn’t find myself.
>> Also another question: Is there a reason for the delay in subsequent
>> Dbpedia
>> releases ? I was wondering , if the code is already there, why does it
>> take
>> 6 months between Dbpedia releases? Is there a manual editorial  involved
>> or
>> is it due  to development/changes  in the framework code which are
>> collated
>> in every release?
>>
>>
>> Thanks and regards,
>>
>> Amit
>> Tech Lead
>> Cloud and Platform Group
>> Yahoo!
>>
>>
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure
>> contains a definitive record of customers, application performance,
>> security threats, fraudulent activity, and more. Splunk takes this
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> Dbpedia-discussion@lists.sourceforge.net
>> <http://Dbpedia-discussion@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Cloud Computing - Latest Buzzword or a Glimpse of the Future?
>> This paper surveys cloud computing today: What are the benefits?
>> Why are businesses embracing it? What are its payoffs and pitfalls?
>> http://www.accelacomm.com/jaw/sdnl/114/51425149/
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> Dbpedia-discussion@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Reply via email to