Re: standard input, for s in f, and buffering

2008-04-01 Thread Jorgen Grahn
On Mon, 31 Mar 2008 22:27:39 -0700 (PDT), Paddy [EMAIL PROTECTED] wrote:
 On Mar 31, 11:47 pm, Jorgen Grahn [EMAIL PROTECTED] wrote:
 On 31 Mar 2008 06:54:29 GMT, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] 
 wrote:

  On Sun, 30 Mar 2008 21:02:44 +, Jorgen Grahn wrote:

  I realize this has to do with the extra read-ahead buffering documented 
  for
  file.next() and that I can work around it by using file.readline()
  instead.

  You can use ``for line in lines:`` and pass ``iter(sys.stdin.readline,'')``
  as iterable for `lines`.

 Thanks.  I wasn't aware that building an iterator was that easy. The
 tiny example program then becomes

 By the way, I timed the three solutions given so far using 5 million
 lines of standard input.  It went like this:

   for s in file :  1
   iter(readline, ''):  1.30  (i.e. 30% worse than for s in file)
   while 1   :  1.45  (i.e. 45% worse than for s in file)
   Perl while():  0.65

 I suspect most of the slowdown comes from the interpreter having to
 execute more user code, not from lack of extra heavy input buffering.

 Hi Juergen,
 From the python manpage:
 -u Force  stdin,  stdout  and stderr to be totally unbuffered.
On systems where it matters, also put stdin, stdout and
stderr in binary mode.  Note that there is internal
buffering in xreadlines(), readlines() and file-object
iterators (for line in sys.stdin) which is not influenced
by this option.  To work around this, you will want to use
sys.stdin.readline() inside a while 1: loop.

 Maybe try adding the python -u option?

Doesn't help when the code is in a module, unfortunately.

 Buffering is supposed to help when processing large amounts of I/O,
 but gives the 'many lines in before any output' that you saw
 originally.

Is supposed to help, yes.  I suspect (but cannot prove) that the
kind of buffering done here doesn't buy more than 10% or so even in
artificial tests, if you consider the fact that for s in f is in
itself a faster construct than my workarounds in user code.

Note that even with buffering, there seems to be one system call per
line when used interactively, and lines are of course passed to user
code one by one.

Lastly, there is still the question about having to press Ctrl-D twice
to end the loop, which I mentioned my the original posting.  That
still feels very wrong.

 If the program is to be mainly used to handle millions of
 lines from a pipe or file, then why not leave the buffering in?
 If you need both interactive and batch friendly I/O modes you might
 need to add the ability to switch between two modes for your program.

That is exactly the tradeoff I am dealing with right now, and I think
I have come to the conclusion that I want no buffering.

My source data set can be huge (gigabytes of text) but in reality it
is boiled down to at most 5 lines by a Perl script further to the
left in my pipeline:

  zcat foo.gz | perl | python  bar

The Perl script takes ~100 times longer time to execute, and both are
designed as filters, which means a modest increase in CPU time for the
Python script isn't visible to the end user.

/Jorgen

-- 
  // Jorgen Grahn grahn@Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se  R'lyeh wgah'nagl fhtagn!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: standard input, for s in f, and buffering

2008-03-31 Thread Marc 'BlackJack' Rintsch
On Sun, 30 Mar 2008 21:02:44 +, Jorgen Grahn wrote:

 I realize this has to do with the extra read-ahead buffering documented for
 file.next() and that I can work around it by using file.readline()
 instead.
 
 The problem is, for s in f is the elegant way of reading files line
 by line. With readline(), I need a much uglier loop.  I cannot find a
 better one than this:
 
 while 1:
 s = sys.stdin.readline()
 if not s: break
 print '', s ,
 
 And also, for s in f works on any iterator f -- so I have to choose
 between two evils: an ugly, non-idiomatic and limiting loop, or one
 which works well until it is used interactively.
 
 Is there a way around this?  Or are the savings in execution time or
 I/O so large that everyone is willing to tolerate this bug?

You can use ``for line in lines:`` and pass ``iter(sys.stdin.readline,
'')`` as iterable for `lines`.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: standard input, for s in f, and buffering

2008-03-31 Thread Jorgen Grahn
On 31 Mar 2008 06:54:29 GMT, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote:
 On Sun, 30 Mar 2008 21:02:44 +, Jorgen Grahn wrote:

 I realize this has to do with the extra read-ahead buffering documented for
 file.next() and that I can work around it by using file.readline()
 instead.
 
 The problem is, for s in f is the elegant way of reading files line
 by line. With readline(), I need a much uglier loop.  I cannot find a
 better one than this:
 
 while 1:
 s = sys.stdin.readline()
 if not s: break
 print '', s ,
 
 And also, for s in f works on any iterator f -- so I have to choose
 between two evils: an ugly, non-idiomatic and limiting loop, or one
 which works well until it is used interactively.
 
 Is there a way around this?  Or are the savings in execution time or
 I/O so large that everyone is willing to tolerate this bug?

 You can use ``for line in lines:`` and pass ``iter(sys.stdin.readline,'')``
 as iterable for `lines`.

Thanks.  I wasn't aware that building an iterator was that easy. The
tiny example program then becomes

#!/usr/bin/env python
import sys

f = iter(sys.stdin.readline, '')
for s in f:
print '', s ,

It is still not the elegant interface I'd prefer, though. Maybe I do
prefer handling file-like objects to handling iterators, after all.

By the way, I timed the three solutions given so far using 5 million
lines of standard input.  It went like this:

  for s in file :  1
  iter(readline, ''):  1.30  (i.e. 30% worse than for s in file)
  while 1   :  1.45  (i.e. 45% worse than for s in file)
  Perl while():  0.65

I suspect most of the slowdown comes from the interpreter having to
execute more user code, not from lack of extra heavy input buffering.

/Jorgen

-- 
  // Jorgen Grahn grahn@Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se  R'lyeh wgah'nagl fhtagn!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: standard input, for s in f, and buffering

2008-03-31 Thread Paddy
On Mar 31, 11:47 pm, Jorgen Grahn [EMAIL PROTECTED] wrote:
 On 31 Mar 2008 06:54:29 GMT, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] 
 wrote:



  On Sun, 30 Mar 2008 21:02:44 +, Jorgen Grahn wrote:

  I realize this has to do with the extra read-ahead buffering documented for
  file.next() and that I can work around it by using file.readline()
  instead.

  The problem is, for s in f is the elegant way of reading files line
  by line. With readline(), I need a much uglier loop.  I cannot find a
  better one than this:

  while 1:
  s = sys.stdin.readline()
  if not s: break
  print '', s ,

  And also, for s in f works on any iterator f -- so I have to choose
  between two evils: an ugly, non-idiomatic and limiting loop, or one
  which works well until it is used interactively.

  Is there a way around this?  Or are the savings in execution time or
  I/O so large that everyone is willing to tolerate this bug?

  You can use ``for line in lines:`` and pass ``iter(sys.stdin.readline,'')``
  as iterable for `lines`.

 Thanks.  I wasn't aware that building an iterator was that easy. The
 tiny example program then becomes

 #!/usr/bin/env python
 import sys

 f = iter(sys.stdin.readline, '')
 for s in f:
 print '', s ,

 It is still not the elegant interface I'd prefer, though. Maybe I do
 prefer handling file-like objects to handling iterators, after all.

 By the way, I timed the three solutions given so far using 5 million
 lines of standard input.  It went like this:

   for s in file :  1
   iter(readline, ''):  1.30  (i.e. 30% worse than for s in file)
   while 1   :  1.45  (i.e. 45% worse than for s in file)
   Perl while():  0.65

 I suspect most of the slowdown comes from the interpreter having to
 execute more user code, not from lack of extra heavy input buffering.

 /Jorgen

 --
   // Jorgen Grahn grahn@Ph'nglui mglw'nafh Cthulhu
 \X/ snipabacken.se  R'lyeh wgah'nagl fhtagn!

Hi Juergen,
From the python manpage:
   -u Force  stdin,  stdout  and stderr to be totally
unbuffered.
  On systems where it matters, also put stdin, stdout and
  stderr in binary mode.  Note that there is internal
  buffering in xreadlines(), readlines() and file-object
  iterators (for line in sys.stdin) which is not
influenced
  by this option.  To work around this, you will want to
use
  sys.stdin.readline() inside a while 1: loop.
Maybe try adding the python -u option?

Buffering is supposed to help when processing large amounts of I/O,
but gives the 'many lines in before any output' that you saw
originally. If the program is to be mainly used to handle millions of
lines from a pipe or file, then why not leave the buffering in?
If you need both interactive and batch friendly I/O modes you might
need to add the ability to switch between two modes for your program.

- Paddy.
-- 
http://mail.python.org/mailman/listinfo/python-list


standard input, for s in f, and buffering

2008-03-30 Thread Jorgen Grahn
One thing that has annoyed me for quite some time.  I apologize if it
has been discussed recently. If I run this program on Unix (Python
2.4.4, on Debian Linux)

import sys
for s in sys.stdin:
print '', s ,

and type the input on the keyboard rather than piping a file into it,
two annoying things happen:

- I don't see any output until I have entered a lot of input
  (approximately 8k). I expect pure Unix filters like this to process
  a line immediately -- that is what cat, grep and other utilities do,
  and also what Perl's while() { ... } construct does.

- I have to type the EOF character *twice* to stop the program. This
  is also highly unusual.

If I saw this behavior in a program, as a long-time Unix user, I'd
call it a bug.

I realize this has to do with the extra read-ahead buffering documented for
file.next() and that I can work around it by using file.readline()
instead.

The problem is, for s in f is the elegant way of reading files line
by line. With readline(), I need a much uglier loop.  I cannot find a
better one than this:

while 1:
s = sys.stdin.readline()
if not s: break
print '', s ,

And also, for s in f works on any iterator f -- so I have to choose
between two evils: an ugly, non-idiomatic and limiting loop, or one
which works well until it is used interactively.

Is there a way around this?  Or are the savings in execution time or
I/O so large that everyone is willing to tolerate this bug?

BR,
/Jorgen

-- 
  // Jorgen Grahn grahn@Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se  R'lyeh wgah'nagl fhtagn!
-- 
http://mail.python.org/mailman/listinfo/python-list