On Sep 9, 2:04 am, Terry Reedy <tjre...@udel.edu> wrote: > On 9/8/2011 9:09 PM, papu wrote: > > > > > Hello, I have a data file (un-structed messy file) from which I have > > to scrub specific list of words (delete words). > > > Here is what I am doing but with no result: > > > infile = "messy_data_file.txt" > > outfile = "cleaned_file.txt" > > > delete_list = ["word_1","word_2"....,"word_n"] > > new_file = [] > > fin=open(infile,"") > > fout = open(outfile,"w+") > > for line in fin: > > for word in delete_list: > > line.replace(word, "") > > fout.write(line) > > fin.close() > > fout.close() > > If you have very many words (and you will need all possible forms of > each word if you do exact matches), The following (untested and > incomplete) should run faster. > > delete_set = {"word_1","word_2"....,"word_n"} > ... > for line in fin: > for word in line.split() > if word not in delete_set: > fout.write(word) # also write space and nl. > > Depending on what your file is like, you might be better with > re.split('(\W+)', line). An example from the manual: > >>> re.split('(\W+)', '...words, words...') > ['', '...', 'words', ', ', 'words', '...', ''] > > so all non-word separator sequences are preserved and written back out > (as they will not match delete set). > > -- > Terry Jan Reedy
re.sub is handy too: import re delete_list=('the','rain','in','spain') regex = re.compile('\W' + '|'.join(delete_list) + '\W') infile='messy' with open(infile, 'r') as f: for l in f: print regex.sub('', l) -- http://mail.python.org/mailman/listinfo/python-list