On 1/22/2010 9:58 PM, Chris Jones wrote:
On Fri, Jan 22, 2010 at 08:46:35PM EST, Terry Reedy wrote:

Do you mean I should just read the file one character at a time?

Whoops, my misdirection (you can .read(1), but this is s  l   o   w.
I meant to suggest processing it a char at a time.

1. If not too big,

for c in open(x, 'rb').read() # left .read() off
# 'b' will get bytes, though ord(c) same for ascii chars for byte or unicode

2. If too big for that,

for line in open():
  for c in line:    # or left off this part


To only count ascii chars, as should be the case for C code,

achars = [0]*63
for c in open('xxx', 'c'):
   try:
     achars[ord(c)-32] += 1
   except IndexError:
     pass

for i,n in enumerate(achars)
   print chr(i), n

or sum subsets as desired.

Thanks much for the snippet, let me play with it and see if I can come
up with a Unicode/utf-8 version.. since while I'm at it I might as well
write something a bit more general than C code.

Since utf-8 is backward-compatible with 7bit ASCII, this shouldn't be
a problem.

For any extended ascii, use larger array without decoding (until print, if need be). For unicode, add encoding to open and 'c in line' will return unicode chars. Then use *one* dict or defaultdict. I think something like

from collections import defaultdict
d = defaultdict(int)
...
    d[c] += 1 # if c is new, d[c] defaults to int() == 0

Terry Jan Reedy



--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to