Hi
I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use
Python and the SAX module (running under Windows 7) to carry out off-line
phrase-searches of Wikipedia and to return a count of the number of hits for
each search. Typical phrase-searches might be of the dog and
On Tuesday, January 28, 2014 5:15:32 PM UTC+5:30, Kevin Glover wrote:
Hi
I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to
use Python and the SAX module (running under Windows 7) to carry out off-line
phrase-searches of Wikipedia and to return a count of the number
Another point:
sax is painful to use compared to full lxml (dom)
But then sax is the only choice when files cross a certain size
Thats why the above question
No matter what the choice of XML parser, I suspect you'll want to
convert it to some other form for processing.
Skip
--
Thanks for the comments, guys. The Wikipedia download is a single XML document,
43.1GB. Any further thoughts?
Kevin
--
https://mail.python.org/mailman/listinfo/python-list
hi,
On 01/29/14 00:31, Kevin Glover wrote:
Thanks for the comments, guys. The Wikipedia download is a single XML
document, 43.1GB. Any further thoughts?
in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
be your only option.
hth,
burak
--
On 28/01/2014 9:45 PM, kevinglove...@gmail.com wrote:
I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX
module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a
count of the number of hits for each
On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote:
hi,
On 01/29/14 00:31, Kevin Glover wrote:
Thanks for the comments, guys. The Wikipedia download is a single XML
document, 43.1GB. Any further thoughts?
in that case, http://lxml.de/tutorial.html#event-driven-parsing