On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that [EMAIL PROTECTED] might have written:
sf: >sf wrote: >> The point is that when you have 100,000s of records, this grep becomes >> really slow? > >There are performance bugs with current versions of grep >and multibyte characters that are only getting addressed now. >To work around these do `export LANG=C` first. You also should use the -F flag that P�draig suggests, since you don't have regular expressions in the B file. >In my experience grep is not scalable since it's O(n^2). >See below (note A and B are randomized versions of >/usr/share/dict/words (and therefore worst case for the >sort method)). > >$ wc -l A B > 45427 A > 45427 B > >$ export LANG=C > >$ time grep -Fvf B A >real 0m0.437s > >$ time sort A B B | uniq -u >real 0m0.262s > >$ rpm -q grep coreutils >grep-2.5.1-16.1 >coreutils-4.5.3-19 sf, you better do your own benchmarks (there is quick, sample code in other posts of mine and P�draig's) on your machine, since on my test machine the numbers are reversed re to these of P�draig's (grep takes half the time). package versions (on SuSE 9.1 64-bit): $ rpm -q grep coreutils grep-2.5.1-427 coreutils-5.2.1-21 language: $ echo $LANG en_US.UTF-8 Caution: both solutions are interexchangeable as long as you don't have duplicate lines in the A file. If you do, use the grep version. -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
