Re: searching algorithm
On 2007-05-11, Terry Reedy <[EMAIL PROTECTED]> wrote: > > "Neil Cerutti" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] >| Every node is a tuple of its letter, a list of its children, and >| a list of its words. So the two 'pelin' nodes would be (with 'e' >| referenced in the 'h' node): >| >| ('h', [('e', [], ['pelin'])], ['pelin']) >| >| That would in turn be "stored" in the t, n, i and s nodes. > [snip] > > At the outer level, I would use a list in order to build the > structure in pieces, one for each letter, and then add them in. > > At the top level, the letters do not need explicit storage. > The first subtree is for words starting with 'a', etc. In > other words, the position of each subtree indicates its > starting letter. For most letters, this can be carried on > another level. Here's a proof of concept prototype of the dictionary searching algorithm. The dictionary I found to test with has roughly 5,000 words, and I didn't attempt any optimizations at all. I'm guessing it can't be sped up very much, even though I wrote the simplest possible implementation. Lookup in the worst case is O(N), where N is the number of letters in your prefix. First, a sample of input/output. lookup: pe people: la gente[Noun] pepper: pepe[Noun] peppercorn: grano di pepe[Noun] peppermint: menta peperita[Noun] peppery: pepato, acre, pungente[Adjective] pepsin: pepsina[Noun] permutation: permutazione[Noun] permutations: permutazioni[Noun] permute: permutare[Verb] perorate: perorare perpendicular: perpendicolare[Adjective] perpendicularly: perpendicolarmente[Adverb] perpendiculars: perpendicolari[Noun] perpetrate: perpetrare[Verb] perpetrated: perpetrato[Verb] perpetual: perpetuo[Adjective] petard: petardo[Noun] petroleum: petrolio[Noun] Now the source. # Each node is a tuple of (letter, node list, word list). A word # list is just a list of strings. root = ('', [], []) def insert(node, key, value): if len(key) == 0: node[2].append(value) else: first = key[0] rest = key[1:] for n in node[1]: if n[0] == first: insert(n, rest, value) return node[1].append((first, [], [])) insert(node, key, value) def lookup(node, key): if len(key) == 0: return node else: first = key[0] rest = key[1:] for v in node[1]: if v[0] == first: return lookup(v, rest) return None def word_list(node, word): for v in node[2]: print '%s: %s' % (word, v) for n in node[1]: word_list(n, word+n[0]) # Italian.txt from http://www.june29.com/IDP/ source = open('Italian.txt', 'r') # Build tree for line in source: line = line.strip() if line[0] == '#': continue key, value = line.split('\t') insert(root, key, value) # Look up a prefix x = raw_input("lookup: ") tree = lookup(root, x) if tree: word_list(tree, x) else: print "Not found." -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
On May 10, 1:26 pm, Gigs_ <[EMAIL PROTECTED]> wrote: > Hi all! > > I have text file (english-croatian dictionary) with words in it in > alphabetical > order. > This file contains 17 words in this format: > english word: croatian word Let's assume it's okay to have all the data in memory. In my experience the very fastest way to do what you want is to store the strings in a sorted list and use the binary search library module bisect. I once compared this with doing something similar with tries and it was much faster. It's also the most simple way to do it, which is nice too :). -- Aaron Watters === never eat anything bigger than your head -- kliban -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
On May 11, 2007, at 3:50 AM, Michael Bentley wrote: > > Here's an idea: use a rats' nest of dictionaries and do all the > lookup work up front when you build the rats' nest. Maybe something > like this: ... Oops! This is better :-) #! /usr/bin/env python import pprint dictionary = """absinth:pelin absinthe:pelin absolute:apsolutan absolute:apsolutni kod absolute:apsolutno absolute:čist absolute:nesumnjiv absolute:potpun absolute:savrsen absolute coordinates:apsolutne koordinate absolute frequency:apsolutna učestalost absolute gap:apsolutni jaz absolute line spacing:apsolutni međurazmak linija absolute majority:apsolutna većina absolute pointing device:apsolutni pokazivački uređaj absolute quantity:apsolutni udio absolute value:apsolutna vrijednost absolute zero:apsolutna nula absolutely:apsolutno absolutely:bezuvjetno absolutely:nezavisno absolutely:potpuno absolutely:samostalno absolutely:sasvim absolution:odrjesenje absolution:oprostaj absolutism:apsolutizam absolve:odrijesiti absolve:osloboditi absorb:absorbirati absorb:apsorbirati absorb:crpsti""" lookup = {'words':{}, 'letters':{}} for translation in dictionary.split('\n'): english, croatian = translation.split(':') if english in lookup['words']: lookup['words'][english].append(croatian) else: lookup['words'][english] = [english, croatian] for position, letter in enumerate(english): if position == 0: youAreHere = lookup['letters'] if letter not in youAreHere: youAreHere[letter] = {'words':[]} if lookup['words'][english] not in youAreHere[letter]['words']: youAreHere[letter]['words'].append(lookup['words'][english]) youAreHere = youAreHere[letter] def tryit(partial): youAreHere = lookup['letters'] for letter in partial: youAreHere = youAreHere[letter] return youAreHere['words'] if __name__ == '__main__': pprint.pprint(tryit('abs')) print '=' * 50 pprint.pprint(tryit('absorb')) -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
Michael Bentley <[EMAIL PROTECTED]> wrote: > > > > Call me dense, but how does one do this in Python - which doesn't have > > pointers? Dictionaries with dictionaries within dictionaries... (with > > each letter as the key and the its children as values) is going to be > > extremely space inefficient, right? > > Isn't *everything* in python essentially a pointer? Dictionaries > with dictionaries within dictionaries... My gut feeling (which means > I have not measured it, so I don't actually know) is that it would > not be space inefficient. Perhaps someone who knows more about this > will speak up? Dicts are hash tables, and therefore, for performance, always keep some "extra" space (so that the table won't be too full). Alex -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
On 2007-05-11, ciju <[EMAIL PROTECTED]> wrote: > By the way, both data structures could be implemented as tuple > in python, for I suppose, if only lookup is needed tuple gives > better performance than list. I used a list instead of a tuple where I thought a list would be convenient while building the data structure. But you could convert everything to tuples in the end, it's true. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
On May 10, 2007, at 12:26 PM, Gigs_ wrote: > Hi all! > > I have text file (english-croatian dictionary) with words in it in > alphabetical > order. > This file contains 17 words in this format: > english word: croatian word > > I want to make instant search for my gui > Instant search, i mean that my program search words and show words > to user as > user type letters. > yes, it needs to be fast > > Can someone give me some points (approaches) how to make this > Should i make indexes and go with file.seek > > or should breake dictionary in peaces and put words that start with > a in one and > with b in another... > ? > > So if i have this words > ... > > if user type: "abs" program should list all words above in english > and in croatian > if user type: "absorb" than program should list last 3 words in > english and in > croatian > > > > > > any help would be appreciate! > my apologies for bad english Here's an idea: use a rats' nest of dictionaries and do all the lookup work up front when you build the rats' nest. Maybe something like this: #! /usr/bin/env python import pprint dictionary = """absinth:pelin absinthe:pelin absolute:apsolutan absolute:apsolutni kod absolute:apsolutno absolute:čist absolute:nesumnjiv absolute:potpun absolute:savrsen absolute coordinates:apsolutne koordinate absolute frequency:apsolutna učestalost absolute gap:apsolutni jaz absolute line spacing:apsolutni međurazmak linija absolute majority:apsolutna većina absolute pointing device:apsolutni pokazivački uređaj absolute quantity:apsolutni udio absolute value:apsolutna vrijednost absolute zero:apsolutna nula absolutely:apsolutno absolutely:bezuvjetno absolutely:nezavisno absolutely:potpuno absolutely:samostalno absolutely:sasvim absolution:odrjesenje absolution:oprostaj absolutism:apsolutizam absolve:odrijesiti absolve:osloboditi absorb:absorbirati absorb:apsorbirati absorb:crpsti""" lookup = {'words':{}, 'letters':{}} for translation in dictionary.split('\n'): english, croatian = translation.split(':') if english in lookup['words']: lookup['words'][english].append(croatian) else: lookup['words'][english] = [croatian] for position, letter in enumerate(english): if position == 0: youAreHere = lookup['letters'] if letter not in youAreHere: youAreHere[letter] = {'words':[]} youAreHere[letter]['words'].append(lookup['words'][english]) youAreHere = youAreHere[letter] def tryit(partial): youAreHere = lookup['letters'] for letter in partial: youAreHere = youAreHere[letter] return youAreHere['words'] if __name__ == '__main__': pprint.pprint(tryit('abso')) Hope this helps, Michael --- The Rules of Optimization are simple. Rule 1: Don't do it. Rule 2 (for experts only): Don't do it yet. -- Michael A. Jackson , "Principles of Program Design", 1975. -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
> > Call me dense, but how does one do this in Python - which doesn't have > pointers? Dictionaries with dictionaries within dictionaries... (with > each letter as the key and the its children as values) is going to be > extremely space inefficient, right? Isn't *everything* in python essentially a pointer? Dictionaries with dictionaries within dictionaries... My gut feeling (which means I have not measured it, so I don't actually know) is that it would not be space inefficient. Perhaps someone who knows more about this will speak up? -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
On May 11, 3:12 am, Neil Cerutti <[EMAIL PROTECTED]> wrote: > On 2007-05-10, Gordon Airporte <[EMAIL PROTECTED]> wrote: > > > > > > >> For the above (abrideged) dictionary, you would generate (use a > >> fixed-width "programmers" font so the tree looks good): > > >> a > >> | > >> b > >> | > >> s > >> / \ > >>i o > >> / / \ > >> n l r > >> / / \ \ > >>t u v b->(absorbirati, crpisti) > >> /| | > >> (pelin)<-h t e->(odrije?iti, osloboditi) > >> | | > >> (pelin)<-e e->(apsolutan, apsolutni kod) > > >> As the user enter letters, you just march down the tree, printing > >> all the words held in leaf nodes held in the current node. > > > Call me dense, but how does one do this in Python - which > > doesn't have pointers? Dictionaries with dictionaries within > > dictionaries... (with each letter as the key and the its > > children as values) is going to be extremely space inefficient, > > right? > > Unfortunately, I don't know the space tradeoffs in Python > offhand. Lists and tuples make excellent trees. > > The above might be stored as follows: > > Every node is a tuple of its letter, a list of its children, and > a list of its words. So the two 'pelin' nodes would be (with 'e' > referenced in the 'h' node): > > ('h', [('e', [], ['pelin'])], ['pelin']) > > That would in turn be "stored" in the t, n, i and s nodes. > > ('s', > [('i', > [('n', >[('t', > [('h', [('e', [], ['pelin'])], ['pelin']) > [])] > [])] >[]), ('o' trie (thanks Terry) omitted for my sanity)]) > > It's a lot harder to write by hand than it would be to use. > > My intuition says it shouldn't be terribly hard on resources for > for a 180K dictionary, but I could be wrong. I'm too lazy to > measure. ;) > > If it does turn out to be unreasonably huge, then you'd fall back > on solving the problem completely with a binary search returning > a range (I'm not sure of the name), which would be more expensive > at run time, but might be fast enough, and would use a minimal > amount of 'resources'. > > -- > Neil Cerutti The problem that I see here is that each time user types a letter, all the leaf nodes in the subtree would have to be traversed again. This computation was already done for all the letters user typed before the last one. My suggestion would be to have two data structures. The first one similar to the tree mentioned above, but leaf nodes need not have Croatian translations, and each node should also have total number of leafs in both left siblings and right siblings of that node. The second data structure would be list of English:Croatian word pairs. As the user types a new letter, we just have to remove(not show), the number of leaf elements in the left and/or right siblings, from left and/or right of the second data structure. So we just have to keep track of the node we r currently at(in the tree) and the start, end index for the second data structure(basically the list to be shown to the user). By the way, both data structures could be implemented as tuple in python, for I suppose, if only lookup is needed tuple gives better performance than list. ciju -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
"Neil Cerutti" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] | Every node is a tuple of its letter, a list of its children, and | a list of its words. So the two 'pelin' nodes would be (with 'e' | referenced in the 'h' node): | | ('h', [('e', [], ['pelin'])], ['pelin']) | | That would in turn be "stored" in the t, n, i and s nodes. [snip] At the outer level, I would use a list in order to build the structure in pieces, one for each letter, and then add them in. At the top level, the letters do not need explicit storage. The first subtree is for words starting with 'a', etc. In other words, the position of each subtree indicates its starting letter. For most letters, this can be carried on another level. tjr -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
On 2007-05-10, Gordon Airporte <[EMAIL PROTECTED]> wrote: > >> For the above (abrideged) dictionary, you would generate (use a >> fixed-width "programmers" font so the tree looks good): >> >> a >> | >> b >> | >> s >> / \ >>i o >> / / \ >> n l r >> / / \ \ >>t u v b->(absorbirati, crpisti) >> /| | >> (pelin)<-h t e->(odrije?iti, osloboditi) >> | | >> (pelin)<-e e->(apsolutan, apsolutni kod) >> >> As the user enter letters, you just march down the tree, printing >> all the words held in leaf nodes held in the current node. > > Call me dense, but how does one do this in Python - which > doesn't have pointers? Dictionaries with dictionaries within > dictionaries... (with each letter as the key and the its > children as values) is going to be extremely space inefficient, > right? Unfortunately, I don't know the space tradeoffs in Python offhand. Lists and tuples make excellent trees. The above might be stored as follows: Every node is a tuple of its letter, a list of its children, and a list of its words. So the two 'pelin' nodes would be (with 'e' referenced in the 'h' node): ('h', [('e', [], ['pelin'])], ['pelin']) That would in turn be "stored" in the t, n, i and s nodes. ('s', [('i', [('n', [('t', [('h', [('e', [], ['pelin'])], ['pelin']) [])] [])] []), ('o' trie (thanks Terry) omitted for my sanity)]) It's a lot harder to write by hand than it would be to use. My intuition says it shouldn't be terribly hard on resources for for a 180K dictionary, but I could be wrong. I'm too lazy to measure. ;) If it does turn out to be unreasonably huge, then you'd fall back on solving the problem completely with a binary search returning a range (I'm not sure of the name), which would be more expensive at run time, but might be fast enough, and would use a minimal amount of 'resources'. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
Gordon Airporte wrote: > >> For the above (abrideged) dictionary, you would generate (use a >> fixed-width "programmers" font so the tree looks good): >> >> a >> | >> b >> | >> s >> / \ >>i o >> / / \ >> n l r >> / / \ \ >>t u v b->(absorbirati, crpisti) >> /| | >> (pelin)<-h t e->(odrije?iti, osloboditi) >> | | >> (pelin)<-e e->(apsolutan, apsolutni kod) >> >> As the user enter letters, you just march down the tree, printing >> all the words held in leaf nodes held in the current node. >> > > Call me dense, but how does one do this in Python - which doesn't have > pointers? Dictionaries with dictionaries within dictionaries... (with > each letter as the key and the its children as values) is going to be > extremely space inefficient, right? How would this be significantly more space inefficient than "pointers"? The implementation of the dict would, in fact, itself be pointers, where each key (same memory requirement if using pointers) is mapped to a pointer to a dict node or a pointer to a leaf, same as with a more "low level" construction. A printout of the graph as a dict might look a little ugly, though. I could be wrong, however. James -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
> For the above (abrideged) dictionary, you would generate (use a > fixed-width "programmers" font so the tree looks good): > > a > | > b > | > s > / \ >i o > / / \ > n l r > / / \ \ >t u v b->(absorbirati, crpisti) > /| | > (pelin)<-h t e->(odrije?iti, osloboditi) > | | > (pelin)<-e e->(apsolutan, apsolutni kod) > > As the user enter letters, you just march down the tree, printing > all the words held in leaf nodes held in the current node. > Call me dense, but how does one do this in Python - which doesn't have pointers? Dictionaries with dictionaries within dictionaries... (with each letter as the key and the its children as values) is going to be extremely space inefficient, right? -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
"Neil Cerutti" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] | On 2007-05-10, Gigs_ <[EMAIL PROTECTED]> wrote: | > if user type: "abs" program should list all words above in | > english and in croatian if user type: "absorb" than program | > should list last 3 words in english and in croatian | | A solution that solves the problem with a data structure might be | a multi-tree. Specific computer science terms are prefix tree or trie (from reTRIEval). http://en.wikipedia.org/wiki/Trie gives an introduction. | Each node points at a set of following letters, and a set of | croatian translations, either of which might be empty, making the | node a leaf. | | For the above (abrideged) dictionary, you would generate (use a | fixed-width "programmers" font so the tree looks good): | | a | | | b | | | s |/ \ | i o | / / \ | n l r |/ / \ \ | t u v b->(absorbirati, crpisti) | /| | | (pelin)<-h t e->(odrije?iti, osloboditi) | | | | (pelin)<-e e->(apsolutan, apsolutni kod) | | As the user enter letters, you just march down the tree, printing | all the words held in leaf nodes held in the current node. tjr | -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
On 2007-05-10, Gigs_ <[EMAIL PROTECTED]> wrote: > Hi all! > > I have text file (english-croatian dictionary) with words in it > in alphabetical order. This file contains 17 words in this > format: english word: croatian word > > I want to make instant search for my gui Instant search, i mean > that my program search words and show words to user as user > type letters. yes, it needs to be fast > > Can someone give me some points (approaches) how to make this > Should i make indexes and go with file.seek > > or should breake dictionary in peaces and put words that start > with a in one and with b in another... > > So if i have this words > [abridged dictionary below] > absinth:pelin > absinthe:pelin > absolute:apsolutan > absolute:apsolutni kod > absolve:odrije?iti > absolve:osloboditi > absorb:absorbirati > absorb:apsorbirati > absorb:crpsti > > if user type: "abs" program should list all words above in > english and in croatian if user type: "absorb" than program > should list last 3 words in english and in croatian A solution that solves the problem with a data structure might be a multi-tree. Each node points at a set of following letters, and a set of croatian translations, either of which might be empty, making the node a leaf. For the above (abrideged) dictionary, you would generate (use a fixed-width "programmers" font so the tree looks good): a | b | s / \ i o / / \ n l r / / \ \ t u v b->(absorbirati, crpisti) /| | (pelin)<-h t e->(odrije?iti, osloboditi) | | (pelin)<-e e->(apsolutan, apsolutni kod) As the user enter letters, you just march down the tree, printing all the words held in leaf nodes held in the current node. -- Neil Cerutti We shall reach greater and greater platitudes of achievement. --Richard J. Daley -- http://mail.python.org/mailman/listinfo/python-list
Re: searching algorithm
I have made a script that search anagram based on the ODS file ( call OSW in english, Official Scrabble Words) it load a file that contain 369085 words (one word per line) I create a dictionnary and store word into using the length of the word as key example : mydict[2] contain a list of word with length = 2 first I select the correct dict entry and in a second time I scan this list searching correct word my file contains 369085 and it's pretty fast Seb -- http://mail.python.org/mailman/listinfo/python-list
searching algorithm
Hi all! I have text file (english-croatian dictionary) with words in it in alphabetical order. This file contains 17 words in this format: english word: croatian word I want to make instant search for my gui Instant search, i mean that my program search words and show words to user as user type letters. yes, it needs to be fast Can someone give me some points (approaches) how to make this Should i make indexes and go with file.seek or should breake dictionary in peaces and put words that start with a in one and with b in another... ? So if i have this words absinth:pelin absinthe:pelin absolute:apsolutan absolute:apsolutni kod absolute:apsolutno absolute:čist absolute:nesumnjiv absolute:potpun absolute:savršen absolute coordinates:apsolutne koordinate absolute frequency:apsolutna učestalost absolute gap:apsolutni jaz absolute line spacing:apsolutni međurazmak linija absolute majority:apsolutna većina absolute pointing device:apsolutni pokazivački uređaj absolute quantity:apsolutni udio absolute value:apsolutna vrijednost absolute zero:apsolutna nula absolutely:apsolutno absolutely:bezuvjetno absolutely:nezavisno absolutely:potpuno absolutely:samostalno absolutely:sasvim absolution:odrješenje absolution:oproštaj absolutism:apsolutizam absolve:odriješiti absolve:osloboditi absorb:absorbirati absorb:apsorbirati absorb:crpsti if user type: "abs" program should list all words above in english and in croatian if user type: "absorb" than program should list last 3 words in english and in croatian any help would be appreciate! my apologies for bad english -- http://mail.python.org/mailman/listinfo/python-list