On Thu, Jul 2, 2015 at 9:57 AM, Joshua Valdez <jd...@case.edu> wrote: > > Hi so I figured out my problem, with this code and its working great but its > still taking a very long time to process...I was wondering if there was a way > to do this with just regular expressions instead of parsing the text with > lxml...
Be careful: there are assumptions here that may not be true. To be clear: regular expressions are not magic. Just because something uses regular expressions does not make it fast. Nor are regular expressions appropriate for parsing tree-structured content. For a humorous discussion of this, see: http://blog.codinghorror.com/parsing-html-the-cthulhu-way/ > the idea would be to identify a <page> tag and then move to the next line of > a file to see if there is a match between the title text and the pages in my > pages file. This makes another assumption about the input that isn't necessarily true. Just because you see tags and content on separate lines now doesn't mean that this won't change in the future. XML tree structure does not depend on newlines. Don't try parsing XML files line-by-line. > So again, my pages are just an array like: [Anarchism, Abrahamic Mythology, > ...] I'm a little confused as to how to even start this my initial idea was > something like this but I'm not sure how to execute it: > wiki --> XML file > page_titles -> array of strings corresponding to titles > > tag = r '(<page>)' > wiki = wiki.readlines() > > for line in wiki: > page = re.search(tag,line) > if page: > ......(I'm not sure what to do here) > > is it possible to look ahead in a loop to discover other lines and then > backtrack? > I think this may be the solution but again I'm not sure how I would execute > such a command structure... You should probably abandon this line of thinking. From your initial problem description, the approach you have now should be computationally *linear* in the size of the input. So maybe your program is fine. The fact that your program was exhausting your computer's entire memory in your initial attempt suggests that your input file is large. But how large? It is much more likely that your program is slow simply because your input is honking huge. To support this hypothesis, we need more knowledge about the input size. How large is your input file? How large is your collection of page_titles? _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor