Re: NEED HELP-process words in a text file

Tim Chase Sat, 18 Jun 2011 17:13:45 -0700

On 06/18/2011 06:21 PM, Cathy James wrote:

     freq = [] #empty dict to accumulate words and word length

While you say you create an empty dict, using "[]" creates anempty *list*, not a dict. Either your comment is wrong or yourcode is wrong. :) Given your usage, I presume you want a dict,not a list.

     for line in filename:
         punc = string.punctuation + string.whitespace#use Python's
built-in punctuation and whiitespace

Since you don't change "punc" in your loop, you'd get betterperformance by hoisting this outside of the loop so it's onlyevaluated once. Not that it should matter *that* greatly, butit's just a bad-code-smell.

         for i, word in enumerate (line.replace (punc, "").lower().split()):

.replace() doesn't operate on sets of characters, but ratherstrings. So unless your line contains the exact text in "punc"(unlikely), that replacement is a NOP. There are a couple waysto go about removing unwanted characters:

- make a set of those characters and produce a resulting stringfrom things not in that set:


 punc_set = set(punc)
 line = ''.join(c for c in line if c not in punc_set)

- use a regexp to strip them out...something like

  punc_re = re.compile("[" + re.escape(punc) + "]")
  ...
  line = punc_re.sub('', line)

- use string translations. I'm not as familiar with these, butthe following seemed to work for me, abusing the 2nd"deletechars" parameter for your particular use-case:


  line = line.translate(None, punc)

I don't see .translate(None) documented anywhere. My randomeffort seemed to work in 2.6, but fails in 2.5 and prior. YMMV.

             if word in freq:
                 freq[word] +=1 #increment current count if word already in dict

             else:
                 freq[word] = 0 #if punctuation encountered,
frequency=0 word length = 0

Again, your 2nd comment disagrees with your code. As an aside,if you're using 2.5 or greater, I'd usecollections.defaultdict(int) as the accumulator:


  freq = collections.defaultdict(int)
  ...
  freq[word] += 1
  # no need to check presence

         for word in freq.items():
             print("Length /t"+"Count/n"+ freq[word],+'/t' +
len(word))#print word count and length of word separated by a tab


Where to begin:

- Your escapes are using "/" instead of "\" for <tab> and<newline> which I expect will mess up the formatting.

- You're also labeling them "Length/Count" but printing"count/length".


- you're iterating over freq.items() but that should be written as

  for word, count in freq.items():

or

  for word in freq:

- Additionally, adding the bits together makes it somewhat hardto understand.


I'd use something like

  for word, count in freq.items():
    print("Word \tLength \tCount\n%s \t%i \t%i" % (
      word, len(word), count))

-tkc


--
http://mail.python.org/mailman/listinfo/python-list

Re: NEED HELP-process words in a text file

Reply via email to