Re: [Dbpedia-discussion] Build Failure during abstract extraction

Jona Christopher Sahnwaldt Thu, 18 Apr 2013 07:25:34 -0700

On 18 April 2013 16:08, Julien Plu <julien....@redaction-developpez.com> wrote:
> Hi Jona,
>
> The API respond correctly :-( I think at least that the
> "SocketTimeoutException" occur because an abstract doesn't exist, no ?


I don't think so. I think it usually happens because generating the
abstract actually takes extremely long for some pages. I don't know
why. Maybe they happen to use some very complex templates. On the
other hand, I took a quick look at the source code of
http://fr.wikipedia.org/wiki/Prix_Ken_Domon and didn't see anything
suspicious.

By the way, an abstract always exists, it may be empty though. The
page content is not stored in the database, we send it to MediaWiki
for each page. That's much faster.

> (because this exception appeared many times during the extraction) but it's
> not blocking.

Yes, many other article need a lot of time, but most of the time it
works on the second or third try.

You could also simply increase the number of retries (currently 3) or
the maximum time (currently 4000 ms) in AbstractExtractor.scala.

>
> And I have some gunzip files into my dump directories with data inside so
> the extraction worked until the error occured.
>
> I rerun the extraction but with "extraction.default.properties" we will see
> if there is an improvement...
>
> And the machine is a virtual machine (from virtualbox) with 2Go of memories
> and 3 cores from my computer so it's normal if it's slow like that. But I
> will try it on another machine, a real server machine.
>
> Best.
>
> Julien.
>
>
> 2013/4/18 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>
>> Hi Julien,
>>
>> That sucks. 21 hours and then it crashes. That's a bummer.
>>
>> I don't know what's going on. You could try calling api.php from the
>> command line using curl and see what happens. Maybe it actually takes
>> extremely long to render that article. Calling api.php is a bit
>> cumbersome though - I think you have to copy the wikitext for the
>> article from the xml dump and construct a POST request. I may be
>> simpler to hack together a little HTML page with a form for all the
>> data you need (page title and content, I think) which POSTs the data
>> to api.php. If you do that, let us know, I'd love to add such a test
>> page to our MediaWiki files in the repo.
>>
>> @all - Is there a simpler way to test the abstract extraction for a
>> single page? I don't remember.
>>
>> By the way, 21 hours for the French WIkipedia sounds pretty slow, if I
>> recall correctly. How many ms per page does the log file say? What
>> kind of machine do you have? I think on our reasonably but not
>> extremely fast machine with four cores it took something like 30 ms
>> per page. Are you sure you activated APC? That makes a huge
>> difference.
>>
>> Good luck,
>> JC
>>
>> On 18 April 2013 11:52, Julien Plu <julien....@redaction-developpez.com>
>> wrote:
>> > Hi,
>> >
>> > After around 21 hours of process the abstract extraction has been
>> > stopped by
>> > a "build failure" :
>> >
>> > avr. 18, 2013 10:33:44 AM
>> >
>> > org.dbpedia.extraction.mappings.AbstractExtractor$$anonfun$retrievePage$1
>> > apply$mcVI$sp
>> > INFO: Error retrieving abstract of title=Prix Ken
>> > Domon;ns=0/Main/;language:wiki=fr,locale=fr. Retrying...
>> > java.net.SocketTimeoutException: Read timed out
>> >     at java.net.SocketInputStream.socketRead0(Native Method)
>> >     at java.net.SocketInputStream.read(SocketInputStream.java:150)
>> >     at java.net.SocketInputStream.read(SocketInputStream.java:121)
>> >     at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>> >     at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>> >     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>> >     at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:633)
>> >     at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:579)
>> >     at
>> >
>> > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1322)
>> >     at
>> >
>> > org.dbpedia.extraction.mappings.AbstractExtractor$$anonfun$retrievePage$1.apply$mcVI$sp(AbstractExtractor.scala:124)
>> >     at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78)
>> >     at
>> >
>> > org.dbpedia.extraction.mappings.AbstractExtractor.retrievePage(AbstractExtractor.scala:109)
>> >     at
>> >
>> > org.dbpedia.extraction.mappings.AbstractExtractor.extract(AbstractExtractor.scala:66)
>> >     at
>> >
>> > org.dbpedia.extraction.mappings.AbstractExtractor.extract(AbstractExtractor.scala:21)
>> >     at
>> >
>> > org.dbpedia.extraction.mappings.CompositeMapping$$anonfun$extract$1.apply(CompositeMapping.scala:13)
>> >     at
>> >
>> > org.dbpedia.extraction.mappings.CompositeMapping$$anonfun$extract$1.apply(CompositeMapping.scala:13)
>> >     at
>> >
>> > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:239)
>> >     at
>> >
>> > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:239)
>> >     at
>> >
>> > scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
>> >     at scala.collection.immutable.List.foreach(List.scala:76)
>> >     at
>> >
>> > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:239)
>> >     at scala.collection.immutable.List.flatMap(List.scala:76)
>> >     at
>> >
>> > org.dbpedia.extraction.mappings.CompositeMapping.extract(CompositeMapping.scala:13)
>> >     at
>> >
>> > org.dbpedia.extraction.mappings.RootExtractor.apply(RootExtractor.scala:23)
>> >     at
>> >
>> > org.dbpedia.extraction.dump.extract.ExtractionJob$$anonfun$1.apply(ExtractionJob.scala:29)
>> >     at
>> >
>> > org.dbpedia.extraction.dump.extract.ExtractionJob$$anonfun$1.apply(ExtractionJob.scala:25)
>> >     at
>> >
>> > org.dbpedia.extraction.util.SimpleWorkers$$anonfun$apply$1$$anon$2.process(Workers.scala:23)
>> >     at
>> >
>> > org.dbpedia.extraction.util.Workers$$anonfun$1$$anon$1.run(Workers.scala:131)
>> >
>> > [INFO]
>> > ------------------------------------------------------------------------
>> > [INFO] BUILD FAILURE
>> > [INFO]
>> > ------------------------------------------------------------------------
>> > [INFO] Total time: 21:33:55.973s
>> > [INFO] Finished at: Thu Apr 18 10:35:37 CEST 2013
>> > [INFO] Final Memory: 10M/147M
>> > [INFO]
>> > ------------------------------------------------------------------------
>> > [ERROR] Failed to execute goal
>> > org.scala-tools:maven-scala-plugin:2.15.2:run
>> > (default-cli) on project dump: wrap:
>> > org.apache.commons.exec.ExecuteException: Process exited with an error:
>> > 137(Exit value: 137) -> [Help 1]
>> > [ERROR]
>> > [ERROR] To see the full stack trace of the errors, re-run Maven with the
>> > -e
>> > switch.
>> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> > [ERROR]
>> > [ERROR] For more information about the errors and possible solutions,
>> > please
>> > read the following articles:
>> > [ERROR] [Help 1]
>> > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>> >
>> > Someone know why this error happened ? Not enough memory ?
>> >
>> > Best.
>> >
>> > Julien.
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Precog is a next-generation analytics platform capable of advanced
>> > analytics on semi-structured data. The platform includes APIs for
>> > building
>> > apps and a phenomenal toolset for data science. Developers can use
>> > our toolset for easy data analysis & visualization. Get a free account!
>> > http://www2.precog.com/precogplatform/slashdotnewsletter
>> > _______________________________________________
>> > Dbpedia-discussion mailing list
>> > Dbpedia-discussion@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>> >
>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Build Failure during abstract extraction

Reply via email to