Sorry for the wait Thomas, I was working to solve the broken pipe issue that stops the parser when it finds an error. I have applied a quick and dirty workaround using try-catch technique and now the process will not stop and just skip the faulty article and keeps going :) it logs the faulty ones in a text file (title and position) for posterior forensics, but my first guesses in that is not a codification issue with utf8 is more an unexpected formating tag the php parser don't know how to deal with Actually parsing the german wikipedia with more than 1.3 million articles
Count: 1043000 Failing count: 2 and keeps going I supose we can sacrificate two articles for having one milion available now :) as you requested I uploaded my working compiled tools[1] but without any xml sources it's about 113Mb, but if you have a working tools on your system you just have to change host-tools/offline-renderer/ArticleParser.py by the attached on this mail and you can forget to cry like a child that his ice cream has fall to the floor when after more than 24h parsing hundred of thousand articles pased the process you see this ugly python error backtrace blablabla and not your desired file :) by the way the faultyarticles.txt is saved at same host-tools/offline-renderer directory, (i'm too lazy to put a parameter for change that and I hardcoded the name of the file , yes... don't waste typing on correct that bad habit, I know) If you have curiosity of what articles on the german wiki are causing troubles on dewiki-latest-pages-articles.xml (date 2009-11-20) ~Storck Bicycle 832673 ~Musculus serratus posterior inferior 857334 Regards I hope I will upload the German wikipedia on Sunday... and will be available on Monday, sorry for the wait but my Asymmetric DSL is very asymmetric and upload 1.5-2 Gb (expected file size) will take a bunch of hours. For those than wants to compile his own , go for it :) the Quickreference in the doc directory on the souce is all you need to start working, just remember than if you have a 64 bit system you will have to follow the 64 bits method to compile the tools, Regards [1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2 David Reyes Samblas Martinez http://www.tuxbrain.com Open ultraportable & embedded solutions Openmoko, Openpandora, Arduino Hey, watch out!!! There's a linux in your pocket!!! 2009/11/27 Thomas HOCEDEZ <thomas.hoce...@free.fr>: > Thomas HOCEDEZ a écrit : >> >> Hi DAvid, >> >> Can you share your scripts & configs to do the same in French (and other >> languages) ? >> Thanks >> >> Thomas >> >> > > As the Mailing list seems to be broken (or users started hibernating for > winter...) I find by myself the way to compile things step by step. > I'm for now rendering the French Wikipedia. As it started a few minutes ago, > the result will be availabel during the weekend (I hope). > > I'll also post the way I managed to do so ! (I'm at the office for now, and > I'm leaving...) > > Regards to you all ! > > Thomas >
ArticleParser.py_dsamblas_modified_try_catch.tar.bz2
Description: BZip2 compressed data
_______________________________________________ Openmoko community mailing list community@lists.openmoko.org http://lists.openmoko.org/mailman/listinfo/community