Re: Wikipedia XML Dump
On Tuesday, January 28, 2014 5:15:32 PM UTC+5:30, Kevin Glover wrote: Hi I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be of the dog and dog's. I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask: Is what I am trying to do actually feasible? Cant really visualize what youve got... When you 'download' wikipedia what do you get? One 40GB file? A zillion files? Some other database format? Another point: sax is painful to use compared to full lxml (dom) But then sax is the only choice when files cross a certain size Thats why the above question Also you may want to explore nltk -- https://mail.python.org/mailman/listinfo/python-list
Re: Wikipedia XML Dump
Another point: sax is painful to use compared to full lxml (dom) But then sax is the only choice when files cross a certain size Thats why the above question No matter what the choice of XML parser, I suspect you'll want to convert it to some other form for processing. Skip -- https://mail.python.org/mailman/listinfo/python-list
Re: Wikipedia XML Dump
Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts? Kevin -- https://mail.python.org/mailman/listinfo/python-list
Re: Wikipedia XML Dump
hi, On 01/29/14 00:31, Kevin Glover wrote: Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts? in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to be your only option. hth, burak -- https://mail.python.org/mailman/listinfo/python-list
Re: Wikipedia XML Dump
On 28/01/2014 9:45 PM, kevinglove...@gmail.com wrote: I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be of the dog and dog's. I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask: Is what I am trying to do actually feasible? Rather than parsing through 40GB+ every time you need to do a search, you should get better performance using an XML database which will allow you to do queries directly on the xml data. http://basex.org/ is one such db, and comes with a Python API: http://docs.basex.org/wiki/Clients -- https://mail.python.org/mailman/listinfo/python-list
Re: Wikipedia XML Dump
On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote: hi, On 01/29/14 00:31, Kevin Glover wrote: Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts? in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to be your only option. Further thoughts?? Just a combo of what Burak and Skip said: I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml to something (more) digestible to nltk -- https://mail.python.org/mailman/listinfo/python-list