Don't be sorry, everyone can have a proffesionnal&personnal life too ;-) !
As I'm not familiar to Python and in the cul-de-sac I was, the retry-without-parsing-faulty-article was the solution I choose too, but without leading to do It. Thanks you showed me. French Wikipedia parsing crashed just after "Tintin and Milou" (I don't know if you know that comic strip -it might change name in spanish-, it's about a reporter leading investigations). I'll read the wikipedia page to find what could be wrong with it. The wrong page is quite big and was a "discussion' page, not really funny to read. I found what might be wrong : it contains the spanish reversed question mark (? but upside down).other pages (for now) presents lot of strange uppercase accented letters (up S with cute accent, E too ...) If this can help... With your help, a first (but incomplete) release will be generated this weekend ! (My friends won't look me as a silly guy reading US wikipedia in the closet) good night, thanks again, my computer will do the rest. (and 'ill put a eye tomorrow on the problems) cheers Thomas Le vendredi 27 novembre 2009 à 23:13 +0100, David Reyes Samblas Martinez a écrit : > Sorry for the wait Thomas, > I was working to solve the broken pipe issue that stops the parser > when it finds an error. I have applied a quick and dirty workaround > using try-catch technique and now the process will not stop and just > skip the faulty article and keeps going :) it logs the faulty ones in > a text file (title and position) for posterior forensics, but my first > guesses in that is not a codification issue with utf8 is more an > unexpected formating tag the php parser don't know how to deal with > Actually parsing the german wikipedia with more than 1.3 million articles > > Count: 1043000 > Failing count: 2 > > and keeps going I supose we can sacrificate two articles for having > one milion available now :) > > as you requested I uploaded my working compiled tools[1] but without > any xml sources it's about 113Mb, but if you have a working tools on > your system you just have to change > host-tools/offline-renderer/ArticleParser.py by the attached on this > mail and you can forget to cry like a child that his ice cream has > fall to the floor when after more than 24h parsing hundred of thousand > articles pased the process you see this ugly python error backtrace > blablabla and not your desired file :) > > by the way the faultyarticles.txt is saved at same > host-tools/offline-renderer directory, (i'm too lazy to put a > parameter for change that and I hardcoded the name of the file , > yes... don't waste typing on correct that bad habit, I know) > > If you have curiosity of what articles on the german wiki are causing troubles > on dewiki-latest-pages-articles.xml (date 2009-11-20) > > ~Storck Bicycle > 832673 > ~Musculus serratus posterior inferior > 857334 > > Regards I hope I will upload the German wikipedia on Sunday... and > will be available on Monday, sorry for the wait but my Asymmetric DSL > is very asymmetric and upload 1.5-2 Gb (expected file size) will take > a bunch of hours. > > For those than wants to compile his own , go for it :) the > Quickreference in the doc directory on the souce is all you need to > start working, just remember than if you have a 64 bit system you > will have to follow the 64 bits method to compile the tools, > > Regards > [1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2 > David Reyes Samblas Martinez > http://www.tuxbrain.com > Open ultraportable & embedded solutions > Openmoko, Openpandora, Arduino > Hey, watch out!!! There's a linux in your pocket!!! > > > > > 2009/11/27 Thomas HOCEDEZ <thomas.hoce...@free.fr>: > > Thomas HOCEDEZ a écrit : > >> > >> Hi DAvid, > >> > >> Can you share your scripts & configs to do the same in French (and other > >> languages) ? > >> Thanks > >> > >> Thomas > >> > >> > > > > As the Mailing list seems to be broken (or users started hibernating for > > winter...) I find by myself the way to compile things step by step. > > I'm for now rendering the French Wikipedia. As it started a few minutes ago, > > the result will be availabel during the weekend (I hope). > > > > I'll also post the way I managed to do so ! (I'm at the office for now, and > > I'm leaving...) > > > > Regards to you all ! > > > > Thomas > > _______________________________________________ Openmoko community mailing list community@lists.openmoko.org http://lists.openmoko.org/mailman/listinfo/community