Re: [Tutor] Searching through large number of string items

2008-04-10 Thread Dinesh B Vadhia
The 10,000 string items are sorted.

The way the autocomplete works is that when a user enters a char eg. 'f', the 
'f' is sent to the server and returns strings with the char 'f'.  You can limit 
the number of items sent back to the browser (say, limit to between 15 and 
100).  The string items containing 'f' are displayed.  The user can then enter 
another char eg. 'a' to make 'fa'.  The autocomplete plugin will search the 
cache to find all items containing 'fa' but may need to go back to the server 
to collect others.  And, so on.  Equally, the user could backspace the 'f' and 
enter 'k'.  The 'k' will be sent to the server to find strings containing 'k', 
and so on.

One way to solve this is with linear search which as you rightly pointed out 
has horrible performance (and it has!).  I'll try the binary search and let you 
know.  I'll also look at the trie structure.

An alternative is to create an in-memory SQLite database of the string items.  
Any thoughts on that?

Dinesh


- Original Message - 
From: Kent Johnson 
To: Dinesh B Vadhia 
Cc: tutor@python.org 
Sent: Thursday, April 10, 2008 5:20 AM
Subject: Re: [Tutor] List comprehensions


Dinesh B Vadhia wrote:
 Kent
  
 I'm using a Javascript autocomplete plugin for an online web 
 application/service.  Each time a user inputs a character, the character 
 is sent to the backend Python program which searches for the character 
 in a list of 10,000 string items.  Once it finds the character, the 
 backend will return that string and N other adjacent string items where 
 N can vary from 20 to 150.  Each string item is sent back to the JS in 
 separate print statements.  Hence, the for loop.

Ok, this sounds a little closer to a real spec. What kind of search are 
you doing? Do you really just search for individual characters or are 
you looking for the entire string entered so far as a prefix? Is the 
list of 10,000 items sorted? Can it be?

You need to look at your real problem and find an appropriate data 
structure, rather than showing us what you think is the solution and 
asking how to make it faster.

For example, if what you have a sorted list of strings and you want to 
find the first string that starts with a given prefix and return the N 
adjacent strings, you could use the bisect module to do a binary search 
rather than a linear search. Binary search of 10,000 items will take 
13-14 comparisons to find the correct location. Your linear search will 
take an average of 5,000 comparisons.

You might also want to use a trie structure though I'm not sure if that 
will let you find adjacent items.
http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/
http://jtauber.com/blog/2005/02/10/updated_python_trie_implementation/

 I haven't done any profiling yet as we are still building the system but 
 it seemed sensible that replacing the for loop with a built-in would 
 help.  Maybe not?

Not. An algorithm with poor big O performance should be *replaced*, 
not optimized.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Searching through large number of string items

2008-04-10 Thread Kent Johnson
Dinesh B Vadhia wrote:
 The 10,000 string items are sorted.
  
 The way the autocomplete works is that when a user enters a char eg. 
 'f', the 'f' is sent to the server and returns strings with the char 
 'f'. 

If it is all strings containing 'f' (not all strings starting with 'f') 
then the binary search will not work. A database might work better for that.

You can get all strings containing some substring x with
[ item for item in list if x in item ]

Of course that is back to linear search. You mentioned before that you 
want to also show adjacent items? I don't know how to do that with a 
database either.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Searching through large number of string items

2008-04-10 Thread Dinesh B Vadhia
Ignore the 'adjacent items' remark.   The rest is correct ie. looking for all 
strings containing a substring x.


- Original Message - 
From: Kent Johnson 
To: Dinesh B Vadhia 
Cc: tutor@python.org 
Sent: Thursday, April 10, 2008 6:32 AM
Subject: Re: [Tutor] Searching through large number of string items


Dinesh B Vadhia wrote:
 The 10,000 string items are sorted.
  
 The way the autocomplete works is that when a user enters a char eg. 
 'f', the 'f' is sent to the server and returns strings with the char 
 'f'. 

If it is all strings containing 'f' (not all strings starting with 'f') 
then the binary search will not work. A database might work better for that.

You can get all strings containing some substring x with
[ item for item in list if x in item ]

Of course that is back to linear search. You mentioned before that you 
want to also show adjacent items? I don't know how to do that with a 
database either.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Searching through large number of string items

2008-04-10 Thread Greg Graham
One idea has to do with the fact that there are only 26 (assuming Latin
alphabet) possible first letters, so I would try splitting up the list
of 10,000 into 26 lists in a dictionary indexed by the first letter.
Just doing that is a big reduction of your search space. That way you
won't be doing the same search every time for a particular first letter.
It might even be worthwhile to split each of those into 26 sublists
based on the second letter. Now you've chopped up your 10,000 words into
676 lists, each of which might be small enough to send to the client
without further searching. (Too bad you won't have an even distribution
across all letters. Then each list would only have 15 words in it.)

 

You could also try using SQLite. I'm using right now in a Django
application, and I'm very happy with the setup and performance,
especially for read operations. With Django, I'm using their ORM, which
is quite nice, so I'm not doing any SQL directly. I think there can be
problems with SQLite when you attempt concurrent writes, but you
wouldn't have that.

 

It's hard to predict which would perform better, a tailor made domain
specific solution written in Python, or a general purpose in-memory
database written in C. I would start with which ever direction you are
most comfortable, and if you can't get satisfactory performance, try the
other route.

 

Greg

 

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Dinesh B Vadhia
Sent: Thursday, April 10, 2008 8:13 AM
To: tutor@python.org
Subject: Re: [Tutor] Searching through large number of string items

 

The 10,000 string items are sorted.

 

The way the autocomplete works is that when a user enters a char eg.
'f', the 'f' is sent to the server and returns strings with the char
'f'.  You can limit the number of items sent back to the browser (say,
limit to between 15 and 100).  The string items containing 'f' are
displayed.  The user can then enter another char eg. 'a' to make 'fa'.
The autocomplete plugin will search the cache to find all items
containing 'fa' but may need to go back to the server to collect others.
And, so on.  Equally, the user could backspace the 'f' and enter 'k'.
The 'k' will be sent to the server to find strings containing 'k', and
so on.

 

One way to solve this is with linear search which as you rightly pointed
out has horrible performance (and it has!).  I'll try the binary search
and let you know.  I'll also look at the trie structure.

 

An alternative is to create an in-memory SQLite database of the string
items.  Any thoughts on that?

 

Dinesh

 

 

- Original Message - 

From: Kent Johnson mailto:[EMAIL PROTECTED]  

To: Dinesh B Vadhia mailto:[EMAIL PROTECTED]  

Cc: tutor@python.org 

Sent: Thursday, April 10, 2008 5:20 AM

Subject: Re: [Tutor] List comprehensions

 

Dinesh B Vadhia wrote:
 Kent
  
 I'm using a Javascript autocomplete plugin for an online web 
 application/service.  Each time a user inputs a character, the
character 
 is sent to the backend Python program which searches for the character

 in a list of 10,000 string items.  Once it finds the character, the 
 backend will return that string and N other adjacent string items
where 
 N can vary from 20 to 150.  Each string item is sent back to the JS in

 separate print statements.  Hence, the for loop.

Ok, this sounds a little closer to a real spec. What kind of search are 
you doing? Do you really just search for individual characters or are 
you looking for the entire string entered so far as a prefix? Is the 
list of 10,000 items sorted? Can it be?

You need to look at your real problem and find an appropriate data 
structure, rather than showing us what you think is the solution and 
asking how to make it faster.

For example, if what you have a sorted list of strings and you want to 
find the first string that starts with a given prefix and return the N 
adjacent strings, you could use the bisect module to do a binary search 
rather than a linear search. Binary search of 10,000 items will take 
13-14 comparisons to find the correct location. Your linear search will 
take an average of 5,000 comparisons.

You might also want to use a trie structure though I'm not sure if that 
will let you find adjacent items.
http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/
http://jtauber.com/blog/2005/02/10/updated_python_trie_implementation/

 I haven't done any profiling yet as we are still building the system
but 
 it seemed sensible that replacing the for loop with a built-in would 
 help.  Maybe not?

Not. An algorithm with poor big O performance should be *replaced*, 
not optimized.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Searching through large number of string items

2008-04-10 Thread Kent Johnson
Dinesh B Vadhia wrote:
 Ignore the 'adjacent items' remark.   The rest is correct ie. looking 
 for all strings containing a substring x.

Perhaps this would help:
http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

A SubstringDict that maps each string to itself would do exactly what 
you want. I have no idea what the performance would be...

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor