js wrote: > Hello, list. > > I have a list of sentence in text files that I use to filter-out some data. > I managed the list so badly that now it's become literally a mess. > > Let's say the list has a sentence below > > 1. "Python has been an important part of Google since the beginning, > and remains so as the system grows and evolves. " > > 2. "Python has been an important part of Google" > > 3. "important part of Google" > > As you see sentence 2 is a subset of sentence 1 > so I don't need to have sentence 1 on the list.
Are you 100% sure that you wnat to throw away the *longer* sentence? > (For some reason, it's no problem to have sentence 3. > Only sentence that has the "same prefix part" is the one I want to remove) > > So I decided to clean up the list. > > I tried to do this simple brute-force manner, like Please don't waste time and bandwith with "like"; post the exact code that you executed. > > --------------------------------------------------------------------------- > sorted_list = sorted(file('thelist'), key=len) > for line in sorted_list[:] # won't compile, missing a colon at end > unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ] > sorted_list = list(set(sorted_list) - (unneeded)) # can't do set - list # Presuming you mean - set(unneeded), but then produces an empty list (see later). # Naming the output "sorted_list" is misleading advertising. # With sorted_list[:] in an inner loop, it's no wonder that it runs slowly. # Test your code on small samples of data and ensure that it is working correctly before you run it on huge amounts of data. # Be careful of end cases; note that in my solutions below, the first or last item needs special handling. > .... > --------------------------------------------------------------------------- > > This is so slow and not so helpful because the list is > so big(more than 100M bytes and has about 3 million lines) > and I have more than 100 lists. Here's one way of doing it, tested to the extent shown: C:\junk>type prefixdel.py from pprint import pprint as pp data = [ 'foo bar baz', 'foo bar', 'foo', 'food', 'food', # duplicate 'xyzzy', 'plugh', 'xyzzy and plugh are magic', 'zzzz', ] """ > sorted_list = sorted(file('thelist'), key=len) > for line in sorted_list[:] > unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ] > sorted_list = list(set(sorted_list) - set(unneeded)) """ slist = sorted(data, key=len) print "OP trial 1 input"; pp(slist) for line in slist[:]: unneeded = [line2 for line2 in slist[:] if line2.startswith(line)] slist = list(set(slist) - set(unneeded)) print "OP trial 1 output"; pp(slist) ilist = sorted(data) print "sorted input"; pp(ilist) olist = [] for i in xrange(len(ilist)-1, 0, -1): if ilist[i].startswith(ilist[i-1]): continue olist.append(ilist[i]) olist.append(ilist[0]) print "output (keeping shorter)"; pp(olist) olist = [] for i in xrange(len(ilist)-1): if ilist[i+1].startswith(ilist[i]): continue olist.append(ilist[i]) olist.append(ilist[-1]) print "output (keeping longer)"; pp(olist) C:\junk>prefixdel.py OP trial 1 input ['foo', 'food', 'food', 'zzzz', 'xyzzy', 'plugh', 'foo bar', 'foo bar baz', 'xyzzy and plugh are magic'] OP trial 1 output [] sorted input ['foo', 'foo bar', 'foo bar baz', 'food', 'food', 'plugh', 'xyzzy', 'xyzzy and plugh are magic', 'zzzz'] output (keeping shorter) ['zzzz', 'xyzzy', 'plugh', 'food', 'foo'] output (keeping longer) ['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', 'zzzz'] HTH, John -- http://mail.python.org/mailman/listinfo/python-list