Please use reply to all: I'm currently not in front of a keyboard at the moment. Others on the mailing list should be able to help. On Jun 30, 2015 6:13 PM, "Joshua Valdez" <jd...@case.edu> wrote:
> Hi Danny, > > So I tried that code snippet you pointed me too and I'm not getting any > output. > > I tried playing around with the code and when I tried > > doc = etree.iterparse(wiki) > for _, node in doc: > print node > > I get output like: > > <Element '{http://www.mediawiki.org/xml/export-0.10/}sitename' at > 0x100602410> > <Element '{http://www.mediawiki.org/xml/export-0.10/}dbname' at > 0x1006024d0> > <Element '{http://www.mediawiki.org/xml/export-0.10/}base' at 0x100602590> > <Element '{http://www.mediawiki.org/xml/export-0.10/}generator' at > 0x100602710> > <Element '{http://www.mediawiki.org/xml/export-0.10/}case' at 0x100602750> > <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at > 0x1006027d0> > <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at > 0x100602810> > <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at > 0x100602850> > <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at > 0x100602890> > <Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at > 0x1006028d0> > > so the .tag function is capturing everything in the string. Do you know > why this and how I can get around it? > > > > > > > > *Joshua Valdez* > *Computational Linguist : Cognitive Scientist > * > > (440)-231-0479 > jd...@case.edu <j...@uw.edu> | j...@uw.edu | jo...@armsandanchors.com > <http://www.linkedin.com/in/valdezjoshua/> > > On Tue, Jun 30, 2015 at 7:27 PM, Danny Yoo <d...@hashcollision.org> wrote: > >> On Tue, Jun 30, 2015 at 8:10 AM, Joshua Valdez <jd...@case.edu> wrote: >> > So I wrote this script to go over a large wiki XML dump and pull out the >> > pages I want. However, every time I run it the kernel displays 'Killed' >> I'm >> > assuming this is a memory issue after reading around but I'm not sure >> where >> > the memory problem is in my script and if there were any tricks to >> reduce >> > the virtual memory usage. >> >> Yes. Unfortunately, this is a common problem with representing a >> potentially large stream of data with a single XML document. The >> straightforward approach to load an XML, to read it all into memory at >> once, doesn't work when files get large. >> >> We can work around this by using a parser that knows how to >> progressively read chunks of the document in a streaming or "pulling" >> approach. Although I don't think Beautiful Soup knows how to do this, >> however, if you're working with XML, there are other libraries that >> work similarly to Beautiful Soup that can work in a streaming way. >> >> There was a thread about this about a year ago that has good >> references, the "XML Parsing from XML" thread: >> >> https://mail.python.org/pipermail/tutor/2014-May/101227.html >> >> Stefan Behnel's contribution to that thread is probably the most >> helpful in seeing example code: >> >> https://mail.python.org/pipermail/tutor/2014-May/101270.html >> >> I think you'll probably want to use xml.etree.cElementTree; I expect >> the code for your situation will look something like (untested >> though!): >> >> ############################### >> from xml.etree.cElementTree import iterparse, tostring >> >> ## ... later in your code, something like this... >> >> doc = iterparse(wiki) >> for _, node in doc: >> if node.tag == "page": >> title = node.find("title").text >> if title in page_titles: >> print tostring(node) >> node.clear() >> ############################### >> >> >> Also, don't use "del" unless you know what you're doing. It's not >> particularly helpful in your particular scenario, and it is cluttering >> up the code. >> >> >> Let us know if this works out or if you're having difficulty, and I'm >> sure folks would be happy to help out. >> >> >> Good luck! >> > > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor