So I wrote this script to go over a large wiki XML dump and pull out the pages I want. However, every time I run it the kernel displays 'Killed' I'm assuming this is a memory issue after reading around but I'm not sure where the memory problem is in my script and if there were any tricks to reduce the virtual memory usage. Here is my code (I assume the problem is in the BeautifulSoup partition as the pages file is pretty small.
from bs4 import BeautifulSoup import sys pages_file = open('pages_file.txt', 'r') #Preprocessing pages = pages_file.readlines() pages = map(lambda s: s.strip(), pages) page_titles=[] for item in pages: item = ''.join([i for i in item if not i.isdigit()]) item = ''.join([i for i in item if ord(i)<126 and ord(i)>31]) item = item.replace(" ","") item = item.replace("_"," ") page_titles.append(item) ##################################### with open(sys.argv[1], 'r') as wiki: soup = BeautifulSoup(wiki) wiki.closed wiki_page = soup.find_all("page") del soup for item in wiki_page: title = item.title.get_text() if title in page_titles: print item del title *Joshua Valdez* *Computational Linguist : Cognitive Scientist * (440)-231-0479 jd...@case.edu <j...@uw.edu> | j...@uw.edu | jo...@armsandanchors.com <http://www.linkedin.com/in/valdezjoshua/> _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor