Hi, can you maybe release this as a patch? I like to inegrate this in github. But I fear I might miss something if I try to fiddle out the changes by hand.
Thanks David Reyes Samblas Martinez wrote: > Sorry for the wait Thomas, > I was working to solve the broken pipe issue that stops the parser > when it finds an error. I have applied a quick and dirty workaround > using try-catch technique and now the process will not stop and just > skip the faulty article and keeps going :) it logs the faulty ones in > a text file (title and position) for posterior forensics, but my first > guesses in that is not a codification issue with utf8 is more an > unexpected formating tag the php parser don't know how to deal with > Actually parsing the german wikipedia with more than 1.3 million articles > > Count: 1043000 > Failing count: 2 > > and keeps going I supose we can sacrificate two articles for having > one milion available now :) > > as you requested I uploaded my working compiled tools[1] but without > any xml sources it's about 113Mb, but if you have a working tools on > your system you just have to change > host-tools/offline-renderer/ArticleParser.py by the attached on this > mail and you can forget to cry like a child that his ice cream has > fall to the floor when after more than 24h parsing hundred of thousand > articles pased the process you see this ugly python error backtrace > blablabla and not your desired file :) > > by the way the faultyarticles.txt is saved at same > host-tools/offline-renderer directory, (i'm too lazy to put a > parameter for change that and I hardcoded the name of the file , > yes... don't waste typing on correct that bad habit, I know) > > If you have curiosity of what articles on the german wiki are causing > troubles > on dewiki-latest-pages-articles.xml (date 2009-11-20) > > ~Storck Bicycle > 832673 > ~Musculus serratus posterior inferior > 857334 > > Regards I hope I will upload the German wikipedia on Sunday... and > will be available on Monday, sorry for the wait but my Asymmetric DSL > is very asymmetric and upload 1.5-2 Gb (expected file size) will take > a bunch of hours. > > For those than wants to compile his own , go for it :) the > Quickreference in the doc directory on the souce is all you need to > start working, just remember than if you have a 64 bit system you > will have to follow the 64 bits method to compile the tools, > > Regards > [1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2 > David Reyes Samblas Martinez > http://www.tuxbrain.com > Open ultraportable & embedded solutions > Openmoko, Openpandora, Arduino > Hey, watch out!!! There's a linux in your pocket!!! > > > > > 2009/11/27 Thomas HOCEDEZ <thomas.hoce...@free.fr>: >> Thomas HOCEDEZ a écrit : >>> >>> Hi DAvid, >>> >>> Can you share your scripts & configs to do the same in French (and >>> other >>> languages) ? >>> Thanks >>> >>> Thomas >>> >>> >> >> As the Mailing list seems to be broken (or users started hibernating for >> winter...) I find by myself the way to compile things step by step. >> I'm for now rendering the French Wikipedia. As it started a few minutes >> ago, >> the result will be availabel during the weekend (I hope). >> >> I'll also post the way I managed to do so ! (I'm at the office for now, >> and >> I'm leaving...) >> >> Regards to you all ! >> >> Thomas >> > _______________________________________________ > Openmoko community mailing list > community@lists.openmoko.org > http://lists.openmoko.org/mailman/listinfo/community > -- _______________________________________________ Openmoko community mailing list community@lists.openmoko.org http://lists.openmoko.org/mailman/listinfo/community