Wikipedia XML Dump

2014-01-28 Thread kevingloveruk
Hi

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use 
Python and the SAX module (running under Windows 7) to carry out off-line 
phrase-searches of Wikipedia and to return a count of the number of hits for 
each search. Typical phrase-searches might be of the dog and dog's.

I have some limited prior programming experience (from many years ago) and I am 
currently learning Python from a course of YouTube tutorials. Before I get much 
further, I wanted to ask:

Is what I am trying to do actually feasible?

Are there any example programs or code snippets that would help me?

Any advice or guidance would be gratefully received.

Best regards,
Kevin Glover
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Wikipedia XML Dump

2014-01-28 Thread Rustom Mody
On Tuesday, January 28, 2014 5:15:32 PM UTC+5:30, Kevin Glover wrote:
 Hi

 I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to 
 use Python and the SAX module (running under Windows 7) to carry out off-line 
 phrase-searches of Wikipedia and to return a count of the number of hits for 
 each search. Typical phrase-searches might be of the dog and dog's.

 I have some limited prior programming experience (from many years ago) and I 
 am currently learning Python from a course of YouTube tutorials. Before I get 
 much further, I wanted to ask:

 Is what I am trying to do actually feasible?

Cant really visualize what youve got...
When you 'download' wikipedia what do you get?
One 40GB file?
A zillion files?
Some other database format?

Another point:
sax is painful to use compared to full lxml (dom)
But then sax is the only choice when files cross a certain size
Thats why the above question

Also you may want to explore nltk
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Wikipedia XML Dump

2014-01-28 Thread Skip Montanaro
 Another point:
 sax is painful to use compared to full lxml (dom)
 But then sax is the only choice when files cross a certain size
 Thats why the above question

No matter what the choice of XML parser, I suspect you'll want to
convert it to some other form for processing.

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Wikipedia XML Dump

2014-01-28 Thread Kevin Glover
Thanks for the comments, guys. The Wikipedia download is a single XML document, 
43.1GB. Any further thoughts?

Kevin
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Wikipedia XML Dump

2014-01-28 Thread Burak Arslan
hi,

On 01/29/14 00:31, Kevin Glover wrote:
 Thanks for the comments, guys. The Wikipedia download is a single XML 
 document, 43.1GB. Any further thoughts?



in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
be your only option.

hth,
burak
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Wikipedia XML Dump

2014-01-28 Thread alex23

On 28/01/2014 9:45 PM, kevinglove...@gmail.com wrote:

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX 
module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a 
count of the number of hits for each search. Typical phrase-searches might be of the 
dog and dog's.

I have some limited prior programming experience (from many years ago) and I am 
currently learning Python from a course of YouTube tutorials. Before I get much 
further, I wanted to ask:

Is what I am trying to do actually feasible?


Rather than parsing through 40GB+ every time you need to do a search, 
you should get better performance using an XML database which will allow 
you to do queries directly on the xml data.


http://basex.org/ is one such db, and comes with a Python API:

http://docs.basex.org/wiki/Clients

--
https://mail.python.org/mailman/listinfo/python-list


Re: Wikipedia XML Dump

2014-01-28 Thread Rustom Mody
On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote:
 hi,

 On 01/29/14 00:31, Kevin Glover wrote:
  Thanks for the comments, guys. The Wikipedia download is a single XML 
  document, 43.1GB. Any further thoughts?

 in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
 be your only option.

Further thoughts?? Just a combo of what Burak and Skip said:
I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml
to something (more) digestible to nltk
-- 
https://mail.python.org/mailman/listinfo/python-list