> I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct.
A few ideas: 1) the shell way: bash$ sort file.in | uniq | wc -l This doesn't strip whitespace...a little sed magic would strip off whitespace for you: bash$ sed 'regexp' file.in | sort | uniq | wc -l where 'regexp' is something like this atrocity 's/^[[:space:]]*\(\([[:space:]]*[^[:space:]][^[:space:]]*\)*\)[[:space:]]*$/\1/' (If your sed supports "\s" and "\S" for "whitespace" and "nonwhitespace", it makes the expression a lot less hairy: 's/^\s*\(\(\s*\S\S*\)*\)\s*$/\1/' and, IMHO, a little easier to read. There might be a nice/concise perl one-liner for this too) 2) use a python set: s = set() for line in open("file.in"): s.add(line.strip()) return len(s) 3) compact #2: return len(set([line.strip() for line in file("file.in")])) or, if stripping the lines isn't a concern, it can just be return len(set(file("file.in"))) The logic in the set keeps track of ensuring that no duplicates get entered. Depending on how many results you *expect*, this could become cumbersome, as you have to have every unique line in memory. A stream-oriented solution can be kinder on system resources, but would require that the input be sorted first. -tkc -- http://mail.python.org/mailman/listinfo/python-list