Emad Nawfal (عماد نوفل) wrote:
Thank you so much Steve,
I followed your advice about calculating o the fly and it really rang a bell. Now I have this script. It's faster and does not give me the nasty memory error message the first one sometimes did:
# Chi-squared collocation discovery
# Important definitions first. Let's suppose that we
# are trying to find whether "powerful computers" is a collocation
# N = The number of all bigrams in the corpus
# O11 = how many times the bigram "powerful computers" occurs in the corpus # O22 = the number of bigrams not having either word in our collocation = N - O11
#  O12 = The number of bigrams whose second word is our second word
# but whose first word is not "powerful"
# O21 = The number of bigrams whose first word is our first word, but whose second word
# is different from oour second word
###########################################################
print """
*************************************************
* Welcome to the Collocationer * *
*                                               *
*************************************************
"""
# Let's first get the text and turn into bigrams
#tested_collocate = raw_input("Enter the bigram you think is a collocation\n")
#word1 = tested_collocate.split()[0]
#word2 = tested_collocate.split()[1]
word1 = 'United'
word2 = 'States'
infile = file("1.txt")
# initilize the counters
N = 0
O11= 0
O22 = 0
O12 = 0
O21 = 0
for line in infile:
length = len(line.split()) # a variable to hold the length of each line

    if len(line.split()) <=1:
        continue
    for word in line.split():
        N+=1
    for i,v in enumerate(line.split()):
        if i< length-1:
            if word1 == v and word2 == line.split()[i+1]:
                O11 +=1
    for i,v in enumerate(line.split()):
        if i < length -1:
            if word1 != v and word2 != line.split()[i+1]:
                O22+=1
    for i,v in enumerate(line.split()):
        if i< length-1:
            if word1 != v and word2 == line.split()[i+1]:
                O12+=1
    for i,v in enumerate(line.split()):
        if i< length-1:
            if word1 == v and word2 != line.split()[i+1]:
                O21+=1
chi2 = (N * ((O11 * O22 - O12 * O21) ** 2))/ float((O11 + O12) * (O11 + O21) * (O12 + O22) * (O21 + O22))
print "Chi-Squared = ", chi2
if chi2 > 3.841:
    print "These two words form a collocation"
else:
    print "These two words do not form a collocation"
I'd like to jump in here and offer a few refinements that make the code simpler and more "Pythonic". In the background I'm also researching how to use dictionaries to make things even better. Some guidelines: - use initial lower case for variable and function names, upper case for classes - don't repeat calculations - do them once and save the result in a variable - don't repeat loops - you can put the calculations for o11 o12 o21 and o22 all under 1 for loop
-  obtain one word at a time as rightWord and then save it as leftWord

# your initial code goes here, up to but not including
# for line in infile:

line = infile.readline().split() # get the first line so we can get the first word
leftWord = line[0]
line = line[1:] # drop the first word
n = 1 # count the first word
o11 = o12 = o21 = o22 = 0
while line:
 n += len(line) # count words
 for rightWord in line:
   if word1 == leftWord and word2 == rightWord:
     o11 += 1
   elif word1 != leftWord and word2 != rightWord:
     o22 += 1
   elif word1 != leftWord and word2 == rightWord:
     o12 += 1
   else: # no need to test
     o21 += 1
   leftWord = rightWord
 line = infile.readline().split()

# rest of your code follows starting with
# chi2 = ...

# If you want to get even "sexier" you could create an array of counters
# counters = [[0,0],[0,0]]
# where the elements left to right represent o22, o12, o21 and o11
# taking advantage of the fact that False == 0 and True == 1:
 for rightWord in line:
   counters[word1 == leftWord][word2 == rightWord] += 1
   leftWord = rightWord
 line = infile.readline().split()




--
Bob Gailer
Chapel Hill NC 919-636-4239

When we take the time to be aware of our feelings and needs we have more satisfying interatctions with others.

Nonviolent Communication provides tools for this awareness.

As a coach and trainer I can assist you in learning this process.

What is YOUR biggest relationship challenge?

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to