Re: prefix search on a large file
I did the test the way you suggested. It took not so long to realize stupid mistakes I made. Thank you. The following is the result of test with timeit(number=1) using fresh copy of the list for every iteration 0.331462860107 0.19401717186 0.186257839203 0.0762069225311 I tried my recursive-function to fix up my big-messed-list. It stops immediately because of 'RuntimeError: maximum recursion limit exceeded' I hope this trial-and-errors getting me good place... anyway, thank you. On 10/13/06, Peter Otten [EMAIL PROTECTED] wrote: js wrote: By eliminating list cloning, my function got much faster than before. I really appreciate you, John. def prefixdel_recursively2(alist): if len(alist) 2: return alist first = alist.pop(0) unneeded = [no for no, line in enumerate(alist) if line.startswith(first)] adjust=0 for i in unneeded: del alist[i+adjust] adjust -= 1 return [first] + prefixdel_recursively(alist) process stime prefixdel_stupidly : 11.9247150421 prefixdel_recursively : 14.6975700855 prefixdel_recursively2 : 0.408113956451 prefixdel_by_john: 7.60227012634 Those are suspicious results. Time it again with number=1, or a fresh copy of the data for every iteration. I also have my doubts whether sorting by length is a good idea. To take it to the extreme: what if your data file contains an empty line? Peter -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
prefix search on a large file
Hello, list. I have a list of sentence in text files that I use to filter-out some data. I managed the list so badly that now it's become literally a mess. Let's say the list has a sentence below 1. Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. 2. Python has been an important part of Google 3. important part of Google As you see sentence 2 is a subset of sentence 1 so I don't need to have sentence 1 on the list. (For some reason, it's no problem to have sentence 3. Only sentence that has the same prefix part is the one I want to remove) So I decided to clean up the list. I tried to do this simple brute-force manner, like --- sorted_list = sorted(file('thelist'), key=len) for line in sorted_list[:] unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ] sorted_list = list(set(sorted_list) - (unneeded)) --- This is so slow and not so helpful because the list is so big(more than 100M bytes and has about 3 million lines) and I have more than 100 lists. I'm not familiar with algorithms/data structure and large-scale data processing, so any advice, suggestions and recommendations will be appreciated. Thank you in advance. -- http://mail.python.org/mailman/listinfo/python-list
Re: prefix search on a large file
js wrote: Hello, list. I have a list of sentence in text files that I use to filter-out some data. I managed the list so badly that now it's become literally a mess. Let's say the list has a sentence below 1. Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. 2. Python has been an important part of Google 3. important part of Google As you see sentence 2 is a subset of sentence 1 so I don't need to have sentence 1 on the list. Are you 100% sure that you wnat to throw away the *longer* sentence? (For some reason, it's no problem to have sentence 3. Only sentence that has the same prefix part is the one I want to remove) So I decided to clean up the list. I tried to do this simple brute-force manner, like Please don't waste time and bandwith with like; post the exact code that you executed. --- sorted_list = sorted(file('thelist'), key=len) for line in sorted_list[:] # won't compile, missing a colon at end unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ] sorted_list = list(set(sorted_list) - (unneeded)) # can't do set - list # Presuming you mean - set(unneeded), but then produces an empty list (see later). # Naming the output sorted_list is misleading advertising. # With sorted_list[:] in an inner loop, it's no wonder that it runs slowly. # Test your code on small samples of data and ensure that it is working correctly before you run it on huge amounts of data. # Be careful of end cases; note that in my solutions below, the first or last item needs special handling. --- This is so slow and not so helpful because the list is so big(more than 100M bytes and has about 3 million lines) and I have more than 100 lists. Here's one way of doing it, tested to the extent shown: C:\junktype prefixdel.py from pprint import pprint as pp data = [ 'foo bar baz', 'foo bar', 'foo', 'food', 'food', # duplicate 'xyzzy', 'plugh', 'xyzzy and plugh are magic', '', ] sorted_list = sorted(file('thelist'), key=len) for line in sorted_list[:] unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ] sorted_list = list(set(sorted_list) - set(unneeded)) slist = sorted(data, key=len) print OP trial 1 input; pp(slist) for line in slist[:]: unneeded = [line2 for line2 in slist[:] if line2.startswith(line)] slist = list(set(slist) - set(unneeded)) print OP trial 1 output; pp(slist) ilist = sorted(data) print sorted input; pp(ilist) olist = [] for i in xrange(len(ilist)-1, 0, -1): if ilist[i].startswith(ilist[i-1]): continue olist.append(ilist[i]) olist.append(ilist[0]) print output (keeping shorter); pp(olist) olist = [] for i in xrange(len(ilist)-1): if ilist[i+1].startswith(ilist[i]): continue olist.append(ilist[i]) olist.append(ilist[-1]) print output (keeping longer); pp(olist) C:\junkprefixdel.py OP trial 1 input ['foo', 'food', 'food', '', 'xyzzy', 'plugh', 'foo bar', 'foo bar baz', 'xyzzy and plugh are magic'] OP trial 1 output [] sorted input ['foo', 'foo bar', 'foo bar baz', 'food', 'food', 'plugh', 'xyzzy', 'xyzzy and plugh are magic', ''] output (keeping shorter) ['', 'xyzzy', 'plugh', 'food', 'foo'] output (keeping longer) ['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', ''] HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: prefix search on a large file
Thank you for the quick reply. Here're the exact code I executed. (including your code) #!/usr/bin/env python from pprint import pprint as pp data = [ 'foo bar baz', 'foo bar', 'foo', 'food', 'food', # duplicate 'xyzzy', 'plugh', 'xyzzy and plugh are magic', '', ] data_sorted_by_len = sorted(data, key=len) data_sorted_by_asc = sorted(data) print OP trial 1 input; pp(data) def prefixdel_stupidly(alist): for line in alist[:]: unneeded = [no for no, line2 in enumerate(alist[1:]) if line2.startswith(line) and line != line2] adjust=1 for i in unneeded: del alist[i+adjust] adjust -= 1 return alist def prefixdel_recursively(alist): if len(alist) 2: return alist unneeded = [no for no, line in enumerate(alist[1:]) if line.startswith(alist[0])] adjust=1 for i in unneeded: del alist[i+adjust] adjust -= 1 return [alist[0]] + prefixdel_recursively(alist[1:]) def prefixdel_by_john(alist): olist = [] for i in xrange(len(alist)-1, 0, -1): if alist[i].startswith(alist[i-1]): continue olist.append(alist[i]) olist.append(alist[0]) return olist if __name__ == '__main__': from timeit import Timer print sorted(prefixdel_stupidly(data_sorted_by_len[:])) print sorted(prefixdel_recursively(data_sorted_by_len[:])) print sorted(prefixdel_by_john(data_sorted_by_asc[:])) t = Timer(prefixdel_stupidly(data_sorted_by_len), from __main__ import prefixdel_stupidly, data_sorted_by_len) print t.timeit(number=10) t = Timer(prefixdel_recursively(data_sorted_by_len), from __main__ import prefixdel_recursively, data_sorted_by_len) print t.timeit(number=10) t = Timer(prefixdel_by_john(data_sorted_by_asc), from __main__ import prefixdel_by_john, data_sorted_by_asc) print t.timeit(number=10) The output is the following: $ python2.5 myprefixdel.py OP trial 1 input ['foo bar baz', 'foo bar', 'foo', 'food', 'food', 'xyzzy', 'plugh', 'xyzzy and plugh are magic', ''] ['foo', 'plugh', 'xyzzy', ''] ['foo', 'plugh', 'xyzzy', ''] ['foo', 'food', 'plugh', 'xyzzy', ''] 1.17837095261 1.21182584763 0.737352132797 Your code is much faster than ones I wrote but the result is a little bit different.('food' shoud be removed because 'foo' is in the list) As you pointed out, list[:] must be a big evil but without duplicating the list, the problem become much harder to solve to me. Any practical solution, anyone? -- http://mail.python.org/mailman/listinfo/python-list
Re: prefix search on a large file
By eliminating list cloning, my function got much faster than before. I really appreciate you, John. def prefixdel_recursively2(alist): if len(alist) 2: return alist first = alist.pop(0) unneeded = [no for no, line in enumerate(alist) if line.startswith(first)] adjust=0 for i in unneeded: del alist[i+adjust] adjust -= 1 return [first] + prefixdel_recursively(alist) process stime prefixdel_stupidly : 11.9247150421 prefixdel_recursively : 14.6975700855 prefixdel_recursively2 : 0.408113956451 prefixdel_by_john: 7.60227012634 -- http://mail.python.org/mailman/listinfo/python-list
Re: prefix search on a large file
js wrote: By eliminating list cloning, my function got much faster than before. I really appreciate you, John. def prefixdel_recursively2(alist): if len(alist) 2: return alist first = alist.pop(0) unneeded = [no for no, line in enumerate(alist) if line.startswith(first)] adjust=0 for i in unneeded: del alist[i+adjust] adjust -= 1 return [first] + prefixdel_recursively(alist) process stime prefixdel_stupidly : 11.9247150421 prefixdel_recursively : 14.6975700855 prefixdel_recursively2 : 0.408113956451 prefixdel_by_john: 7.60227012634 Those are suspicious results. Time it again with number=1, or a fresh copy of the data for every iteration. I also have my doubts whether sorting by length is a good idea. To take it to the extreme: what if your data file contains an empty line? Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: prefix search on a large file
Try: def leaders(sorted_list): held = None for phrase in held: if held is None or not phrase.startswith(held): held = row yield held print list(leaders(sorted(data))) --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: prefix search on a large file
def leaders(sorted_list): held = None for phrase in held: Ummm...that for phrase in None: doesn't do a whole lot other than give a traceback. :*) I suspect you mean for phrase in sorted_list: no? -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: prefix search on a large file
Tim Chase wrote: def leaders(sorted_list): held = None for phrase in held: ... I suspect you mean for phrase in sorted_list: no? Ooops -- renaming code before posting should get its own test. You are absolutely correct. --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list