Re: prefix search on a large file

2006-10-13 Thread js
I did the test the way you suggested. It took not so long to realize
stupid mistakes I made. Thank you.

The following is the result of test with timeit(number=1)
using fresh copy of the list  for every iteration

0.331462860107
0.19401717186
0.186257839203
0.0762069225311

I tried my recursive-function to fix up my big-messed-list.
It stops immediately because of 'RuntimeError: maximum recursion limit exceeded'

I hope this trial-and-errors getting me good place...

anyway, thank you.

On 10/13/06, Peter Otten [EMAIL PROTECTED] wrote:
 js  wrote:

  By eliminating list cloning, my function got much faster than before.
  I really appreciate you, John.
 
  def prefixdel_recursively2(alist):
  if len(alist)  2:
  return alist
 
  first = alist.pop(0)
  unneeded = [no for no, line in enumerate(alist) if
  line.startswith(first)] adjust=0
  for i in unneeded:
  del alist[i+adjust]
  adjust -= 1
 
  return [first] + prefixdel_recursively(alist)
 
 
  process stime
  prefixdel_stupidly : 11.9247150421
  prefixdel_recursively   : 14.6975700855
  prefixdel_recursively2 : 0.408113956451
  prefixdel_by_john: 7.60227012634

 Those are suspicious results. Time it again with number=1, or a fresh copy
 of the data for every iteration.

 I also have my doubts whether sorting by length is a good idea. To take it
 to the extreme: what if your data file contains an empty line?

 Peter
 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list


prefix search on a large file

2006-10-12 Thread js
 Hello, list.

 I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. 

2. Python has been an important part of Google

3. important part of Google

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
(For some reason, it's no problem to have sentence 3.
Only sentence that has the same prefix part is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner,  like

---
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
  unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
  sorted_list = list(set(sorted_list) - (unneeded))

---

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.

I'm not familiar with algorithms/data structure and large-scale data processing,
so any advice, suggestions and recommendations will be appreciated.

Thank you in advance.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: prefix search on a large file

2006-10-12 Thread John Machin

js  wrote:
 Hello, list.

  I have a list of sentence in text files that I use to filter-out some data.
 I managed the list so badly that now it's become literally a mess.

 Let's say the list has a sentence below

 1. Python has been an important part of Google since the beginning,
 and remains so as the system grows and evolves. 

 2. Python has been an important part of Google

 3. important part of Google

 As you see sentence 2 is a subset of sentence 1
 so I don't need to have sentence 1 on the list.

Are you 100% sure that you wnat to throw away the *longer* sentence?

 (For some reason, it's no problem to have sentence 3.
 Only sentence that has the same prefix part is the one I want to remove)

 So I decided to clean up the list.

 I tried to do this simple brute-force manner,  like

Please don't waste time and bandwith with like; post the exact code
that you executed.


 ---
 sorted_list = sorted(file('thelist'), key=len)
 for line in sorted_list[:]
# won't compile, missing a colon at end
   unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
   sorted_list = list(set(sorted_list) - (unneeded))
#  can't do set - list
# Presuming you mean - set(unneeded), but then produces an empty list
(see later).
# Naming the output sorted_list is misleading advertising.
# With sorted_list[:] in an inner loop, it's no wonder that it runs
slowly.
# Test your code on small samples of data and ensure that it is working
correctly before you run it on huge amounts of data.
# Be careful of end cases; note that in my solutions below, the first
or last item needs special handling.

 
 ---

 This is so slow and not so helpful because the list is
 so big(more than 100M bytes and has about 3 million lines)
 and I have more than 100 lists.

Here's one way of doing it, tested to the extent shown:

C:\junktype prefixdel.py
from pprint import pprint as pp
data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'',
]


 sorted_list = sorted(file('thelist'), key=len)
 for line in sorted_list[:]
   unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
   sorted_list = list(set(sorted_list) - set(unneeded))

slist = sorted(data, key=len)
print OP trial 1 input; pp(slist)
for line in slist[:]:
unneeded = [line2 for line2 in slist[:] if line2.startswith(line)]
slist = list(set(slist) - set(unneeded))
print OP trial 1 output; pp(slist)

ilist = sorted(data)
print sorted input; pp(ilist)

olist = []
for i in xrange(len(ilist)-1, 0, -1):
if ilist[i].startswith(ilist[i-1]):
continue
olist.append(ilist[i])
olist.append(ilist[0])
print output (keeping shorter); pp(olist)

olist = []
for i in xrange(len(ilist)-1):
if ilist[i+1].startswith(ilist[i]):
continue
olist.append(ilist[i])
olist.append(ilist[-1])
print output (keeping longer); pp(olist)


C:\junkprefixdel.py
OP trial 1 input
['foo',
 'food',
 'food',
 '',
 'xyzzy',
 'plugh',
 'foo bar',
 'foo bar baz',
 'xyzzy and plugh are magic']
OP trial 1 output
[]
sorted input
['foo',
 'foo bar',
 'foo bar baz',
 'food',
 'food',
 'plugh',
 'xyzzy',
 'xyzzy and plugh are magic',
 '']
output (keeping shorter)
['', 'xyzzy', 'plugh', 'food', 'foo']
output (keeping longer)
['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', '']

HTH,
John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: prefix search on a large file

2006-10-12 Thread js
Thank you for the quick reply.

Here're the exact code I executed. (including your code)

#!/usr/bin/env python
from pprint import pprint as pp

data = [
   'foo bar baz',
   'foo bar',
   'foo',
   'food',
   'food', # duplicate
   'xyzzy',
   'plugh',
   'xyzzy and plugh are magic',
   '',
   ]

data_sorted_by_len = sorted(data, key=len)
data_sorted_by_asc = sorted(data)
print OP trial 1 input; pp(data)

def prefixdel_stupidly(alist):
for line in alist[:]:
unneeded = [no for no, line2 in enumerate(alist[1:]) if
line2.startswith(line) and line != line2]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return alist

def prefixdel_recursively(alist):
if len(alist)  2:
return alist

unneeded = [no for no, line in enumerate(alist[1:]) if
line.startswith(alist[0])]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [alist[0]] + prefixdel_recursively(alist[1:])

def prefixdel_by_john(alist):
olist = []
for i in xrange(len(alist)-1, 0, -1):
if alist[i].startswith(alist[i-1]):
continue
olist.append(alist[i])
olist.append(alist[0])
return olist

if __name__ == '__main__':
from timeit import Timer
print sorted(prefixdel_stupidly(data_sorted_by_len[:]))
print sorted(prefixdel_recursively(data_sorted_by_len[:]))
print sorted(prefixdel_by_john(data_sorted_by_asc[:]))

t = Timer(prefixdel_stupidly(data_sorted_by_len), from __main__
import prefixdel_stupidly, data_sorted_by_len)
print t.timeit(number=10)
t = Timer(prefixdel_recursively(data_sorted_by_len), from
__main__ import prefixdel_recursively, data_sorted_by_len)
print t.timeit(number=10)
t = Timer(prefixdel_by_john(data_sorted_by_asc), from __main__
import prefixdel_by_john, data_sorted_by_asc)
print t.timeit(number=10)

The output is the following:

$ python2.5 myprefixdel.py
OP trial 1 input
['foo bar baz',
 'foo bar',
 'foo',
 'food',
 'food',
 'xyzzy',
 'plugh',
 'xyzzy and plugh are magic',
 '']
['foo', 'plugh', 'xyzzy', '']
['foo', 'plugh', 'xyzzy', '']
['foo', 'food', 'plugh', 'xyzzy', '']
1.17837095261
1.21182584763
0.737352132797

Your code is much faster than ones I wrote
but the result is a little bit different.('food' shoud be removed
because 'foo' is in the list)

As you pointed out, list[:] must be a big evil but without duplicating the list,
the problem become much harder to solve to me.

Any practical solution, anyone?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: prefix search on a large file

2006-10-12 Thread js
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist)  2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if line.startswith(first)]
adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)


process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively   : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john: 7.60227012634
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: prefix search on a large file

2006-10-12 Thread Peter Otten
js  wrote:

 By eliminating list cloning, my function got much faster than before.
 I really appreciate you, John.
 
 def prefixdel_recursively2(alist):
 if len(alist)  2:
 return alist
 
 first = alist.pop(0)
 unneeded = [no for no, line in enumerate(alist) if
 line.startswith(first)] adjust=0
 for i in unneeded:
 del alist[i+adjust]
 adjust -= 1
 
 return [first] + prefixdel_recursively(alist)
 
 
 process stime
 prefixdel_stupidly : 11.9247150421
 prefixdel_recursively   : 14.6975700855
 prefixdel_recursively2 : 0.408113956451
 prefixdel_by_john: 7.60227012634

Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: prefix search on a large file

2006-10-12 Thread Scott David Daniels

Try:

 def leaders(sorted_list):
 held = None
 for phrase in held:
 if held is None or not phrase.startswith(held):
 held = row
 yield held

 print list(leaders(sorted(data)))

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: prefix search on a large file

2006-10-12 Thread Tim Chase
  def leaders(sorted_list):
  held = None
  for phrase in held:

Ummm...that

for phrase in None:

doesn't do a whole lot other than give a traceback. :*)

I suspect you mean

for phrase in sorted_list:

no?

-tkc


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: prefix search on a large file

2006-10-12 Thread Scott David Daniels
Tim Chase wrote:
  def leaders(sorted_list):
  held = None
  for phrase in held:
 ... I suspect you mean
 for phrase in sorted_list:
 no?

Ooops -- renaming code before posting should get its own test.
You are absolutely correct.


--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list