On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that [EMAIL PROTECTED]
might have written:

sf:

>sf wrote:
>> The point is that when you have 100,000s of records, this grep becomes
>> really slow?
>
>There are performance bugs with current versions of grep
>and multibyte characters that are only getting addressed now.
>To work around these do `export LANG=C` first.

You also should use the -F flag that P�draig suggests, since you don't
have regular expressions in the B file.

>In my experience grep is not scalable since it's O(n^2).
>See below (note A and B are randomized versions of
>/usr/share/dict/words (and therefore worst case for the
>sort method)).
>
>$ wc -l A B
>   45427 A
>   45427 B
>
>$ export LANG=C
>
>$ time grep -Fvf B A
>real    0m0.437s
>
>$ time sort A B B | uniq -u
>real    0m0.262s
>
>$ rpm -q grep coreutils
>grep-2.5.1-16.1
>coreutils-4.5.3-19

sf, you better do your own benchmarks (there is quick, sample code in
other posts of mine and P�draig's) on your machine, since on my test
machine the numbers are reversed re to these of P�draig's (grep takes
half the time).

package versions (on SuSE 9.1 64-bit):

$ rpm -q grep coreutils
grep-2.5.1-427
coreutils-5.2.1-21

language:
$ echo $LANG
en_US.UTF-8

Caution: both solutions are interexchangeable as long as you don't have
duplicate lines in the A file.  If you do, use the grep version.
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to