On Tuesday, January 28, 2014 5:15:32 PM UTC+5:30, Kevin Glover wrote: > Hi
> I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to > use Python and the SAX module (running under Windows 7) to carry out off-line > phrase-searches of Wikipedia and to return a count of the number of hits for > each search. Typical phrase-searches might be "of the dog" and "dog's". > I have some limited prior programming experience (from many years ago) and I > am currently learning Python from a course of YouTube tutorials. Before I get > much further, I wanted to ask: > Is what I am trying to do actually feasible? Cant really visualize what youve got... When you 'download' wikipedia what do you get? One 40GB file? A zillion files? Some other database format? Another point: sax is painful to use compared to full lxml (dom) But then sax is the only choice when files cross a certain size Thats why the above question Also you may want to explore nltk -- https://mail.python.org/mailman/listinfo/python-list