Re: [Tutor] Bigrams and nested dictionaries

Orri Ganel Mon, 03 Apr 2006 22:02:01 -0700

My only comment is that this considers spaces and punctuation (like parentheses, brackets, etc.), too, which I assume you don't want seeing as how that has little to do with natural languages. My suggestion would be to remove the any punctuation or whitespace keys from the dictionary after you've built the dictionary. Unless, of course, you're interested in what letter words start with, in which case you could somehow merge them into a single key?

Cheers,
Orri

On 4/4/06, Michael Broe <[EMAIL PROTECTED]> wrote:

Well coming up with this has made me really love Python. I worked on
this with my online pythonpenpal Kyle, and here is what we came up
with. Thanks to all for input so far.

My first idea was to use a C-type indexing for-loop, to grab a two-
element sequence [i, i+1]:

dict = {}
for i in range(len(t) - 1):
        if not dict.has_key(t[i]):
                dict[t[i]] = {}
        if not dict[t[i]].has_key(t[i+1]):
                dict[t[i]][t[i+1]] = 1
        else:
                dict[t[i]][t[i+1]] += 1

Which works, but. Kyle had an alternative take, with no indexing, and
after we worked on this strategy it seemed very Pythonesque, and ran
almost twice as fast.

----

#!/usr/local/bin/python

import sys
file = open(sys.argv[1], 'rb').read()

# We imagine a 2-byte 'window' moving over the text from left to right
#
#          +-------+
# L  o  n  | d   o |  n  .  M  i  c  h  a  e  l  m  a  s      t  e
r  m ...
#          +-------+
#
# At any given point, we call the leftmost byte visible in the window
'L', and the
# rightmost byte 'R'.
#
#          +-----------+
# L  o  n  | L=d   R=o |  n  .  M  i  c  h  a  e  l  m  a  s      t
e  r  m ...
#          +-----------+
#
# When the program begins, the first byte is preloaded into L, and we
position R
# at the second byte of the file.
#

dict = {}

L = file[0]
for R in file[1:]:      # move right edge of window across the file
        if not L in dict:
                dict[L] = {}

        if not R in dict[L]:
                dict[L][R] = 1
        else:
                dict[L][R] += 1

        L = R           # move character in R over to L

# that's it. here's a printout strategy:

for entry in dict:
        print entry, ':', sum(dict[entry].values())
        print dict[entry]
        print

----

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

--
Email: singingxduck AT gmail DOT com
AIM: singingxduck
Programming Python for the fun of it.

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Bigrams and nested dictionaries

Reply via email to