On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote: > hi,
> On 01/29/14 00:31, Kevin Glover wrote: > > Thanks for the comments, guys. The Wikipedia download is a single XML > > document, 43.1GB. Any further thoughts? > in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to > be your only option. Further thoughts?? Just a combo of what Burak and Skip said: I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml to something (more) digestible to nltk -- https://mail.python.org/mailman/listinfo/python-list