When using mvn scala:run, use MAVEN_OPTS=-Xmx rather than JAVA_OPTS The dump also comes in 27 files rather than one big one. You can use these alternatively.
-- @tommychheng qwiki.com On Wed, Nov 30, 2011 at 11:01 PM, Amit Kumar <amitk...@yahoo-inc.com> wrote: > > Hi Pablo, > Thanks for your valuable input. I got the Mediawiki think working and am > able to run the abstract extractor as well. > > The extraction framework works well for a small sample dataset e.g > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 > which > has around 6300 entries. But when I try to run the framework on the full > wikipedia data(en, around 33GB uncompressed) I get java heap space errors. > > -------------------------------------- > Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space > at > java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45) > at java.lang.StringBuilder.<init>(StringBuilder.java:80) > at > scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:43) > at > scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:48) > at > org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:48) > at > org.dbpedia.extraction.dump.Extract$ExtractionThread$$anonfun$run$1.apply(Extract.scala:34) > at scala.collection.Iterator$class.foreach(Iterator.scala:652) > at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:333) > at > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:41) > at > scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:80) > at > org.dbpedia.extraction.dump.Extract$ExtractionThread.run(Extract.scala:34) > Exception in thread "Thread-6" java.lang.OutOfMemoryError: Java heap space > at > scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120) > at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128) > at scala.collection.immutable.List.$colon$colon$colon(List.scala:78) > at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26) > at > org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:35) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64) > at scala.Option.foreach(Option.scala:198) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63) > at > org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24) > at > org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) > at > org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:194) > > > There are several instances of GC overhead limit errors also > Exception in thread "Thread-3" java.lang.OutOfMemoryError: GC overhead limit > exceeded > at > scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:120) > at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:42) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:128) > at scala.collection.immutable.List.$colon$colon$colon(List.scala:78) > at org.dbpedia.extraction.destinations.Graph.merge(Graph.scala:26) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:39) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$writeType$1$1.apply(TableMapping.scala:37) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.writeType$1(TableMapping.scala:37) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:73) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1$$anonfun$apply$4.apply(TableMapping.scala:64) > at scala.Option.foreach(Option.scala:198) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:64) > at > org.dbpedia.extraction.mappings.TableMapping$$anonfun$extractTable$1.apply(TableMapping.scala:63) > at > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) > at scala.collection.immutable.List.foreach(List.scala:45) > at > org.dbpedia.extraction.mappings.TableMapping.extractTable(TableMapping.scala:63) > at > org.dbpedia.extraction.mappings.TableMapping.extract(TableMapping.scala:24) > at > org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) > at > org.dbpedia.extraction.mappings.MappingExtractor$$anonfun$1.apply(MappingExtractor.scala:47) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194) > Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run > SEVERE: Error reading pages. Shutting down... > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOfRange(Arrays.java:3209) > at java.lang.String.<init>(String.java:215) > at java.lang.StringBuffer.toString(StringBuffer.java:585) > at > com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:859) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:241) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:203) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:159) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:107) > at > org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:87) > at > org.dbpedia.extraction.sources.XMLSource$XMLFileSource.foreach(XMLSource.scala:40) > at > org.dbpedia.extraction.dump.ExtractionJob.run(ExtractionJob.scala:54) > Nov 18, 2011 7:58:14 AM org.dbpedia.extraction.dump.ExtractionJob run > > > I’m trying to run this on both a 32 bit and a 64bit machines (dev boxes) but > to no avail. I’m guessing the default JVM configurations are low for the > DEF. > It would be great is someone can tell the minimum memory requirement for the > extraction framework. I tried giving jvm options such –Xmx to the ‘mvn > scala:run’ command, but it seems that the mvn command spawn another > processes and fails to pass on the flags to the new one. If someone has been > able to run the framework, could you please share me the details. > > Also We are looking into to running the framework over Hadoop. Has anyone > tried that yet ? If yes, could you share you experience, also if it is > really possible to run this on Hadoop without many changes and Hacks. > > Thanks > Amit > > > > > > > > > > > On 11/23/11 2:42 PM, "Pablo Mendes" <pablomen...@gmail.com> wrote: > > > Hi Amit, > Thanks for your interest in DBpedia. Most of my effort has gone into DBpedia > Spotlight, but I can try to help with the DBpedia Extraction Framework as > well. Maybe the core developers can chip in if I misrepresent somewhere. > > 1) [more docs] > > > I am unaware. > > > 2) [typo in config] > > > Seems ok. > > > 3) ... Am I right ? Does the framework work on any particular dump of > Wikipedia? Also what goes in the commons branch ? > > > Yes. As far as I can tell, you're right. But there is no particular dump. > You just need to follow the convention for the directory structure. The > commons directory has a similar structure, see: > > wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml > > I think this file is only used by the image extractor and maybe a couple of > others. Maybe it should be only mandatory if the corresponding extractors > are included in the config. But it's likely nobody got around to > implementing that catch yet. > > > 4) It seems the AbstractExtractor requires an instance of Mediawiki running > to parse mediawiki syntax. ... Can someone shed some more light on this ? > What customization is required ? Where can I get one ? > > > The abstract extractor is used to render inline templates, as many articles > start with automatically generated content from templates. See: > http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction > > > > Also another question: Is there a reason for the delay in subsequent Dbpedia > releases ? I was wondering , if the code is already there, why does it take > 6 months between Dbpedia releases? Is there a manual editorial involved or > is it due to development/changes in the framework code which are collated > in every release? > > > One reason might be that a lot of the value in DBpedia comes from manually > generated "homogenization" in mappings.dbpedia.org > <http://mappings.dbpedia.org> . That, plus getting a stable version of the > framework tested and run would probably explain the choice of periodicity. > > Best, > Pablo > > On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <amitk...@yahoo-inc.com> wrote: > > > Hey everyone, > I’m trying to setup the Dbpedia extraction framework as I’m interested in > getting structured data from already downloaded wikipedia dumps. As per my > understanding I need to work in the ‘dump’ directory of the codebase. I > have tried to reverse engineer ( given scala is new for me) but I need some > help. > > First of all, is there a more detailed documentation somewhere about setting > and running the pipeline. The one available on dbpedia.org > <http://dbpedia.org> seems insufficient. > I understand that I need to create a config.properties file first where I > need to setup input/output locations, list of extractors and the languages. > I tried working with the config.properties.default given in the code. There > seems to be some typo in the extractor list. > ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’ using > this gives ‘class not found’ error. I converted it to > ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ? > I can’t find the documentation on how to setup the input directory. Can > someone tell me the details? From what I gather, input directory should > contain a ‘commons’ directory plus, directory for all languages set in > config.properties. All these directories must have a subdirectory whose name > should be of YYYYMMDD format. Within that you save the xml files such as > enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work on > any particular dump of Wikipedia? Also what goes in the commons branch ? > I ran the framework by copying a sample dump > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 > in both en and commons branch. Unzipping them and renaming as per > requirement. For now I’m working with en language only. It works with the > default 19 extractors but, starts failing if I include AbstractExtractor. It > seems the AbstractExtractor requires an instance of Mediawiki running to > parse mediawiki syntax. From the file itself, “DBpedia-customized MediaWiki > instance is required.” Can someone shed some more light on this ? What > customization is required ? Where can I get one ? > > > > Sorry if the question are too basic and already mentioned somewhere. I have > tried looking but couldn’t find myself. > Also another question: Is there a reason for the delay in subsequent Dbpedia > releases ? I was wondering , if the code is already there, why does it take > 6 months between Dbpedia releases? Is there a manual editorial involved or > is it due to development/changes in the framework code which are collated > in every release? > > > Thanks and regards, > > Amit > Tech Lead > Cloud and Platform Group > Yahoo! > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Dbpedia-discussion mailing list > Dbpedia-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > > > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Dbpedia-discussion mailing list > Dbpedia-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > -- @tommychheng http://tommy.chheng.com ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion