Emad Nawfal (عماد نوفل) wrote:
Thank you so much Steve,
I followed your advice about calculating o the fly and it really rang
a bell. Now I have this script. It's faster and does not give me the
nasty memory error message the first one sometimes did:
# Chi-squared collocation discovery
# Important definitions first. Let's suppose that we
# are trying to find whether "powerful computers" is a collocation
# N = The number of all bigrams in the corpus
# O11 = how many times the bigram "powerful computers" occurs in the
corpus
# O22 = the number of bigrams not having either word in our
collocation = N - O11
# O12 = The number of bigrams whose second word is our second word
# but whose first word is not "powerful"
# O21 = The number of bigrams whose first word is our first word, but
whose second word
# is different from oour second word
###########################################################
print """
*************************************************
* Welcome to the Collocationer
* *
* *
*************************************************
"""
# Let's first get the text and turn into bigrams
#tested_collocate = raw_input("Enter the bigram you think is a
collocation\n")
#word1 = tested_collocate.split()[0]
#word2 = tested_collocate.split()[1]
word1 = 'United'
word2 = 'States'
infile = file("1.txt")
# initilize the counters
N = 0
O11= 0
O22 = 0
O12 = 0
O21 = 0
for line in infile:
length = len(line.split()) # a variable to hold the length of each
line
if len(line.split()) <=1:
continue
for word in line.split():
N+=1
for i,v in enumerate(line.split()):
if i< length-1:
if word1 == v and word2 == line.split()[i+1]:
O11 +=1
for i,v in enumerate(line.split()):
if i < length -1:
if word1 != v and word2 != line.split()[i+1]:
O22+=1
for i,v in enumerate(line.split()):
if i< length-1:
if word1 != v and word2 == line.split()[i+1]:
O12+=1
for i,v in enumerate(line.split()):
if i< length-1:
if word1 == v and word2 != line.split()[i+1]:
O21+=1
chi2 = (N * ((O11 * O22 - O12 * O21) ** 2))/ float((O11 + O12) * (O11
+ O21) * (O12 + O22) * (O21 + O22))
print "Chi-Squared = ", chi2
if chi2 > 3.841:
print "These two words form a collocation"
else:
print "These two words do not form a collocation"
I'd like to jump in here and offer a few refinements that make the code
simpler and more "Pythonic". In the background I'm also researching how
to use dictionaries to make things even better. Some guidelines:
- use initial lower case for variable and function names, upper case
for classes
- don't repeat calculations - do them once and save the result in a
variable
- don't repeat loops - you can put the calculations for o11 o12 o21 and
o22 all under 1 for loop
- obtain one word at a time as rightWord and then save it as leftWord
# your initial code goes here, up to but not including
# for line in infile:
line = infile.readline().split() # get the first line so we can get the
first word
leftWord = line[0]
line = line[1:] # drop the first word
n = 1 # count the first word
o11 = o12 = o21 = o22 = 0
while line:
n += len(line) # count words
for rightWord in line:
if word1 == leftWord and word2 == rightWord:
o11 += 1
elif word1 != leftWord and word2 != rightWord:
o22 += 1
elif word1 != leftWord and word2 == rightWord:
o12 += 1
else: # no need to test
o21 += 1
leftWord = rightWord
line = infile.readline().split()
# rest of your code follows starting with
# chi2 = ...
# If you want to get even "sexier" you could create an array of counters
# counters = [[0,0],[0,0]]
# where the elements left to right represent o22, o12, o21 and o11
# taking advantage of the fact that False == 0 and True == 1:
for rightWord in line:
counters[word1 == leftWord][word2 == rightWord] += 1
leftWord = rightWord
line = infile.readline().split()
--
Bob Gailer
Chapel Hill NC
919-636-4239
When we take the time to be aware of our feelings and
needs we have more satisfying interatctions with others.
Nonviolent Communication provides tools for this awareness.
As a coach and trainer I can assist you in learning this process.
What is YOUR biggest relationship challenge?
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor