Re: python vs. grep
Ricardo Aráoz [EMAIL PROTECTED] writes: The easy/simple (too easy/simple?) way I see out of it is to read THE WHOLE file into memory and don't worry. But what if the file is too The easiest and simplest approach is often the best with Python. Reading in the whole file is rarely too heavy, and you omit the python object overhead entirely - all the code executes in the fast C extensions. If the file is too big, you might want to look up mmap: http://effbot.org/librarybook/mmap.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
Ville M. Vainio wrote: Ricardo Aráoz [EMAIL PROTECTED] writes: The easy/simple (too easy/simple?) way I see out of it is to read THE WHOLE file into memory and don't worry. But what if the file is too The easiest and simplest approach is often the best with Python. Keep forgetting that! If the file is too big, you might want to look up mmap: http://effbot.org/librarybook/mmap.htm Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
Ville Vainio wrote: On May 8, 8:11 pm, Ricardo Aráoz [EMAIL PROTECTED] wrote: All these examples assume your regular expression will not span multiple lines, but this can easily be the case. How would you process the file with regular expressions that span multiple lines? re.findall/ finditer, as I said earlier. Hi, sorry took so long to answer. Too much work. findall/finditer do not address the issue, they merely find ALL the matches in a STRING. But if you keep reading the files a line at a time (as most examples given in this thread do) then you are STILL in trouble when a regular expression spans multiple lines. The easy/simple (too easy/simple?) way I see out of it is to read THE WHOLE file into memory and don't worry. But what if the file is too heavy? So I was wondering if there is any other way out of it. Does grep read the whole file into memory? Does it ONLY process a line at a time? -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
On Tue, 13 May 2008 00:03:08 +1000, Ricardo Aráoz [EMAIL PROTECTED] wrote: Ville Vainio wrote: On May 8, 8:11 pm, Ricardo Aráoz [EMAIL PROTECTED] wrote: All these examples assume your regular expression will not span multiple lines, but this can easily be the case. How would you process the file with regular expressions that span multiple lines? re.findall/ finditer, as I said earlier. Hi, sorry took so long to answer. Too much work. findall/finditer do not address the issue, they merely find ALL the matches in a STRING. But if you keep reading the files a line at a time (as most examples given in this thread do) then you are STILL in trouble when a regular expression spans multiple lines. The easy/simple (too easy/simple?) way I see out of it is to read THE WHOLE file into memory and don't worry. But what if the file is too heavy? So I was wondering if there is any other way out of it. Does grep read the whole file into memory? Does it ONLY process a line at a time? -- http://mail.python.org/mailman/listinfo/python-list Standard grep can only match a line at a time. Are you thinking about sed, which has a sliding window? See http://www.gnu.org/software/sed/manual/sed.html, Section 4.13 -- Kam-Hung Soh a href=http://kamhungsoh.com/blog;Software Salariman/a -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
On May 8, 8:11 pm, Ricardo Aráoz [EMAIL PROTECTED] wrote: All these examples assume your regular expression will not span multiple lines, but this can easily be the case. How would you process the file with regular expressions that span multiple lines? re.findall/ finditer, as I said earlier. -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
Anton Slesarev wrote: I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance could be even better. But I have some problem with writing performance grep analog. https://svn.enthought.com/svn/sandbox/grin/trunk/ hth, Alan Isaac -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
Alan Isaac wrote: Anton Slesarev wrote: I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance could be even better. But I have some problem with writing performance grep analog. https://svn.enthought.com/svn/sandbox/grin/trunk/ As the author of grin I can definitively state that it is not at all competitive with grep in terms of speed. grep reads files really fast. awk is probably beatable, though. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
Anton Slesarev wrote: I try to save my time not cpu cycles) I've got file which I really need to parse: -rw-rw-r-- 1 xxx xxx 3381564736 May 7 09:29 bigfile That's my results: $ time grep python bigfile | wc -l 2470 real0m4.744s user0m2.441s sys 0m2.307s And python scripts: import sys if len(sys.argv) != 3: print 'grep.py pattern file' sys.exit(1) f = open(sys.argv[2],'r') print ''.join((line for line in f if sys.argv[1] in line)), $ time python grep.py python bigfile | wc -l 2470 real0m37.225s user0m34.215s sys 0m3.009s Second script: import sys if len(sys.argv) != 3: print 'grepwc.py pattern file' sys.exit(1) f = open(sys.argv[2],'r',1) print sum((1 for line in f if sys.argv[1] in line)), time python grepwc.py python bigfile 2470 real0m39.357s user0m34.410s sys 0m4.491s 40 sec and 5. This is really sad... That was on freeBSD. On windows cygwin. Size of bigfile is ~50 mb $ time grep python bigfile | wc -l 51 real0m0.196s user0m0.169s sys 0m0.046s $ time python grepwc.py python bigfile 51 real0m25.485s user0m2.733s sys 0m0.375s -- http://mail.python.org/mailman/listinfo/python-list All these examples assume your regular expression will not span multiple lines, but this can easily be the case. How would you process the file with regular expressions that span multiple lines? -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
I try to save my time not cpu cycles) I've got file which I really need to parse: -rw-rw-r-- 1 xxx xxx 3381564736 May 7 09:29 bigfile That's my results: $ time grep python bigfile | wc -l 2470 real0m4.744s user0m2.441s sys 0m2.307s And python scripts: import sys if len(sys.argv) != 3: print 'grep.py pattern file' sys.exit(1) f = open(sys.argv[2],'r') print ''.join((line for line in f if sys.argv[1] in line)), $ time python grep.py python bigfile | wc -l 2470 real0m37.225s user0m34.215s sys 0m3.009s Second script: import sys if len(sys.argv) != 3: print 'grepwc.py pattern file' sys.exit(1) f = open(sys.argv[2],'r',1) print sum((1 for line in f if sys.argv[1] in line)), time python grepwc.py python bigfile 2470 real0m39.357s user0m34.410s sys 0m4.491s 40 sec and 5. This is really sad... That was on freeBSD. On windows cygwin. Size of bigfile is ~50 mb $ time grep python bigfile | wc -l 51 real0m0.196s user0m0.169s sys 0m0.046s $ time python grepwc.py python bigfile 51 real0m25.485s user0m2.733s sys 0m0.375s -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
On May 6, 10:42 pm, Anton Slesarev [EMAIL PROTECTED] wrote: flines = (line for line in f if pat.search(line)) What about re.findall() / re.finditer() for the whole file contents? -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
Anton Slesarev wrote: But I have some problem with writing performance grep analog. I don't think you can ever catch grep. Searching is its only purpose in life and its very good at it. You may be able to come closer, this thread relates. http://groups.google.com/group/comp.lang.python/browse_thread/thread/2f564523f476840a/d9476da5d7a9e466 This relates to the speed of re. If you don't need regex don't use re. If you do need re an alternate re library might be useful but you aren't going to catch grep. -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
On May 7, 7:22 pm, Pop User [EMAIL PROTECTED] wrote: Anton Slesarev wrote: But I have some problem with writing performance grep analog. I don't think you can ever catch grep. Searching is its only purpose in life and its very good at it. You may be able to come closer, this thread relates. http://groups.google.com/group/comp.lang.python/browse_thread/thread/... This relates to the speed of re. If you don't need regex don't use re. If you do need re an alternate re library might be useful but you aren't going to catch grep. In my last test I dont use re. As I understand the main problem in reading file. -- http://mail.python.org/mailman/listinfo/python-list
python vs. grep
I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance could be even better. But I have some problem with writing performance grep analog. It's my script: import re pat = re.compile(sometext) f = open(bigfile,'r') flines = (line for line in f if pat.search(line)) c=0 for x in flines: c+=1 print c and bash: grep sometext bigfile | wc -l Python code 3-4 times slower on windows. And as I remember on linux the same situation... Buffering in open even increase time. Is it possible to increase file reading performance? -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
On Tue, May 6, 2008 at 1:42 PM, Anton Slesarev [EMAIL PROTECTED] wrote: Is it possible to increase file reading performance? Dunno about that, but this part: flines = (line for line in f if pat.search(line)) c=0 for x in flines: c+=1 print c could be rewritten as just: print sum(1 for line in f if pat.search(line)) -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
Anton Slesarev [EMAIL PROTECTED] writes: f = open(bigfile,'r') flines = (line for line in f if pat.search(line)) c=0 for x in flines: c+=1 print c It would be simpler (and probably faster) not to use a generator expression: search = re.compile('sometext').search c = 0 for line in open('bigfile'): if search(line): c += 1 Perhaps faster (because the number of name lookups is reduced), using itertools.ifilter: from itertools import ifilter c = 0 for line in ifilter(search, 'bigfile'): c += 1 If 'sometext' is just text (no regexp wildcards) then even simpler: ... for line in ...: if 'sometext' in line: c += 1 I don't believe you'll easily beat grep + wc using Python though. Perhaps faster? sum(bool(search(line)) for line in open('bigfile')) sum(1 for line in ifilter(search, open('bigfile'))) ...etc... All this is untested! -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list
Re: python vs. grep
2008/5/6, Anton Slesarev [EMAIL PROTECTED]: But I have some problem with writing performance grep analog. [...] Python code 3-4 times slower on windows. And as I remember on linux the same situation... Buffering in open even increase time. Is it possible to increase file reading performance? The best advice would be not to try to beat grep, but if you really want to, this is the right place ;) Here is my code: $ cat grep.py import sys if len(sys.argv) != 3: print 'grep.py pattern file' sys.exit(1) f = open(sys.argv[2],'r') print ''.join((line for line in f if sys.argv[1] in line)), $ ls -lh debug.0 -rw-r- 1 gminick root 4,1M 2008-05-07 00:49 debug.0 --- $ time grep nusia debug.0 |wc -l 26009 real0m0.042s user0m0.020s sys 0m0.004s --- --- $ time python grep.py nusia debug.0 |wc -l 26009 real0m0.077s user0m0.044s sys 0m0.016s --- --- $ time grep nusia debug.0 real0m3.163s user0m0.016s sys 0m0.064s --- --- $ time python grep.py nusia debug.0 [26009 lines here...] real0m2.628s user0m0.032s sys 0m0.064s --- So, printing the results take 2.6 secs for python and 3.1s for original grep. Suprised? The only reason for this is that we have reduced the number of write calls in the python example: $ strace -ooriggrep.log grep nusia debug.0 $ grep write origgrep.log |wc -l 26009 $ strace -opygrep.log python grep.py nusia debug.0 $ grep write pygrep.log |wc -l 12 Wish you luck saving your CPU cycles :) -- Regards, Wojtek Walczak http://www.stud.umk.pl/~wojtekwa/ -- http://mail.python.org/mailman/listinfo/python-list