what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread Ross Reyes
HI -
Sorry for maybe a too simple a question but I googled and also checked my 
reference O'Reilly Learning Python
book and I did not find a satisfactory answer.

When I use readlines, what happens if the number of lines is huge?I have 
a very big file (4GB) I want to
read in, but I'm sure there must be some limitation to readlines and I'd 
like to know how it is handled by python.
I am using it like this:
slines = infile.readlines() # reads all lines into a list of strings called 
"slines"

Thanks for anyone who knows the answer to this one. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread [EMAIL PROTECTED]
newer python should use "for x in fh:", according to the doc :

fh = open("your file")
for x in fh: print x

which would only read one line at a time.

Ross Reyes wrote:
> HI -
> Sorry for maybe a too simple a question but I googled and also checked my
> reference O'Reilly Learning Python
> book and I did not find a satisfactory answer.
>
> When I use readlines, what happens if the number of lines is huge?I have
> a very big file (4GB) I want to
> read in, but I'm sure there must be some limitation to readlines and I'd
> like to know how it is handled by python.
> I am using it like this:
> slines = infile.readlines() # reads all lines into a list of strings called
> "slines"
> 
> Thanks for anyone who knows the answer to this one.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be?read with "readlines()"

2005-11-19 Thread Ben Finney
Ross Reyes <[EMAIL PROTECTED]> wrote:
> Sorry for maybe a too simple a question but I googled and also
> checked my reference O'Reilly Learning Python book and I did not
> find a satisfactory answer.

The Python documentation is online, and it's good to get familiar with
it:

http://docs.python.org/>

It's even possible to tell Google to search only that site with
"site:docs.python.org" as a search term.

> When I use readlines, what happens if the number of lines is huge?
> I have a very big file (4GB) I want to read in, but I'm sure there
> must be some limitation to readlines and I'd like to know how it is
> handled by python.

The documentation on methods of the 'file' type describes the
'readlines' method, and addresses this concern.

http://docs.python.org/lib/bltin-file-objects.html#l2h-244>

-- 
 \ "If you're not part of the solution, you're part of the |
  `\   precipitate."  -- Steven Wright |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread MrJean1
Just try it, it is not that hard ... ;-)

/Jean Brouwers

PS) Here is what happens on Linux:

  $ limit vmemory 1
  $ python
  ...
  >>> s = file().readlines()
  Traceback (most recent call last):
File "", line 1 in ?
  MemoryError
  >>>

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread Xiao Jianfeng
[EMAIL PROTECTED] wrote:

>newer python should use "for x in fh:", according to the doc :
>
>fh = open("your file")
>for x in fh: print x
>
>which would only read one line at a time.
>
>  
>
 I have some other questions:

 when "fh" will be closed?

 And what shoud I do if I want to explicitly close the file immediately 
after reading all data I want?

>Ross Reyes wrote:
>  
>
>>HI -
>>Sorry for maybe a too simple a question but I googled and also checked my
>>reference O'Reilly Learning Python
>>book and I did not find a satisfactory answer.
>>
>>When I use readlines, what happens if the number of lines is huge?I have
>>a very big file (4GB) I want to
>>read in, but I'm sure there must be some limitation to readlines and I'd
>>like to know how it is handled by python.
>>I am using it like this:
>>slines = infile.readlines() # reads all lines into a list of strings called
>>"slines"
>>
>>Thanks for anyone who knows the answer to this one.
>>
>>
>
>  
>

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread Steven D'Aprano
On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:

>  I have some other questions:
> 
>  when "fh" will be closed?

When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp


f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.

>  And what shoud I do if I want to explicitly close the file immediately 
> after reading all data I want?

That is the best practice.

f.close()


-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread Xiao Jianfeng
Steven D'Aprano wrote:

>On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
>
>  
>
>> I have some other questions:
>>
>> when "fh" will be closed?
>>
>>
>
>When all references to the file are no longer in scope:
>
>def handle_file(name):
>fp = file(name, "r")
># reference to file now in scope
>do_stuff(fp)
>return fp
>
>
>f = handle_file("myfile.txt)
># reference to file is now in scope
>f = None
># reference to file is no longer in scope
>
>At this point, Python *may* close the file. CPython currently closes the
>file as soon as all references are out of scope. JPython does not -- it
>will close the file eventually, but you can't guarantee when.
>
>  
>
>> And what shoud I do if I want to explicitly close the file immediately 
>>after reading all data I want?
>>
>>
>
>That is the best practice.
>
>f.close()
>
>
>  
>
 Let me introduce my problem I came across last night first.

 I need to read a file(which may be small or very big) and to check line 
by line
 to find a specific token, then the data on the next line will be what I 
want.
 
 If I use readlines(), it will be a problem when the file is too big.

 If I use "for line in OPENED_FILE:" to read one line each time, how can 
I get
 the next line when I find the specific token?
 And I think reading one line each time is less efficient, am I right?

 
 Regards,

 xiaojf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread Steve Holden
Xiao Jianfeng wrote:
> Steven D'Aprano wrote:
> 
> 
>>On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
>>
>> 
>>
>>
>>>I have some other questions:
>>>
>>>when "fh" will be closed?
>>>   
>>>
>>
>>When all references to the file are no longer in scope:
>>
>>def handle_file(name):
>>   fp = file(name, "r")
>>   # reference to file now in scope
>>   do_stuff(fp)
>>   return fp
>>
>>
>>f = handle_file("myfile.txt)
>># reference to file is now in scope
>>f = None
>># reference to file is no longer in scope
>>
>>At this point, Python *may* close the file. CPython currently closes the
>>file as soon as all references are out of scope. JPython does not -- it
>>will close the file eventually, but you can't guarantee when.
>>
>> 
>>
>>
>>>And what shoud I do if I want to explicitly close the file immediately 
>>>after reading all data I want?
>>>   
>>>
>>
>>That is the best practice.
>>
>>f.close()
>>
>>
>> 
>>
> 
>  Let me introduce my problem I came across last night first.
> 
>  I need to read a file(which may be small or very big) and to check line 
> by line
>  to find a specific token, then the data on the next line will be what I 
> want.
>  
>  If I use readlines(), it will be a problem when the file is too big.
> 
>  If I use "for line in OPENED_FILE:" to read one line each time, how can 
> I get
>  the next line when I find the specific token?
>  And I think reading one line each time is less efficient, am I right?
> 
Not necessarily. Try this:

 f = file("filename.txt")
 for line in f:
 if token in line: # or whatever you need to identify it
 break
 else:
 sys.exit("File does not contain token")
 line = f.next()

Then line will be the one you want. Since this will use code written in 
C to do the processing you will probably be pleasantly surprised by its 
speed. Only if this isn't fast enough should you consider anything more 
complicated.

Premature optimizations can waste huge amounts of unnecessary 
programming time. Don't do it. First try measuring a solution that works!

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread Steven D'Aprano
On Sun, 20 Nov 2005 12:28:07 +0800, Xiao Jianfeng wrote:

>  Let me introduce my problem I came across last night first.
> 
>  I need to read a file(which may be small or very big) and to check line 
> by line
>  to find a specific token, then the data on the next line will be what I 
> want.
>  
>  If I use readlines(), it will be a problem when the file is too big.
> 
>  If I use "for line in OPENED_FILE:" to read one line each time, how can 
> I get
>  the next line when I find the specific token?

Here is one solution using a flag:

done = False
for line in file("myfile", "r"):
if done:
break
done = line == "token\n"  # note the newline
# we expect Python to close the file when we exit the loop
if done:
DoSomethingWith(line) # the line *after* the one with the token
else:
print "Token not found!"


Here is another solution, without using a flag:

def get_line(filename, token):
"""Returns the next line following a token, or None if not found.
Leading and trailing whitespace is ignored when looking for
the token.
"""
fp = file(filename, "r")
for line in fp:
if line.strip() == token:
break
else:
# runs only if we didn't break
print "Token not found"
result = None
result = fp.readline()  # read the next line only 
fp.close() 
return result


Here is a third solution that raises an exception instead of printing an
error message:

def get_line(filename, token):
for line in file(filename, "r"):
if line.strip() == token:
break
else:
raise ValueError("Token not found")
return fp.readline()
# we rely on Python to close the file when we are done



>  And I think reading one line each time is less efficient, am I right?

Less efficient than what? Spending hours or days writing more complex code
that only saves you a few seconds, or even runs slower?

I believe Python will take advantage of your file system's buffering
capabilities. Try it and see, you'll be surprised how fast it runs. If you
try it and it is too slow, then come back and we'll see what can be done
to speed it up. But don't try to speed it up before you know if it is fast
enough.


-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-19 Thread Steven D'Aprano
On Sun, 20 Nov 2005 16:10:58 +1100, Steven D'Aprano wrote:

> def get_line(filename, token):
> """Returns the next line following a token, or None if not found.
> Leading and trailing whitespace is ignored when looking for
> the token.
> """
> fp = file(filename, "r")
> for line in fp:
> if line.strip() == token:
> break
> else:
> # runs only if we didn't break
> print "Token not found"
> result = None
> result = fp.readline()  # read the next line only 
> fp.close() 
> return result

Correction: checking the Library Reference, I find that this is
wrong. The reason is that file objects implement their own read-ahead
buffer, and mixing calls to next() and readline() may not work right.

See http://docs.python.org/lib/bltin-file-objects.html

Replace the fp.readline() with fp.next() and all should be good.


-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-20 Thread Xiao Jianfeng
Steve Holden wrote:

>Xiao Jianfeng wrote:
>  
>
>>Steven D'Aprano wrote:
>>
>>
>>
>>
>>>On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
>>>
>>>
>>>
>>>
>>>  
>>>
I have some other questions:

when "fh" will be closed?
  



>>>When all references to the file are no longer in scope:
>>>
>>>def handle_file(name):
>>>  fp = file(name, "r")
>>>  # reference to file now in scope
>>>  do_stuff(fp)
>>>  return fp
>>>
>>>
>>>f = handle_file("myfile.txt)
>>># reference to file is now in scope
>>>f = None
>>># reference to file is no longer in scope
>>>
>>>At this point, Python *may* close the file. CPython currently closes the
>>>file as soon as all references are out of scope. JPython does not -- it
>>>will close the file eventually, but you can't guarantee when.
>>>
>>>
>>>
>>>
>>>  
>>>
And what shoud I do if I want to explicitly close the file immediately 
after reading all data I want?
  



>>>That is the best practice.
>>>
>>>f.close()
>>>
>>>
>>>
>>>
>>>  
>>>
>> Let me introduce my problem I came across last night first.
>>
>> I need to read a file(which may be small or very big) and to check line 
>>by line
>> to find a specific token, then the data on the next line will be what I 
>>want.
>> 
>> If I use readlines(), it will be a problem when the file is too big.
>>
>> If I use "for line in OPENED_FILE:" to read one line each time, how can 
>>I get
>> the next line when I find the specific token?
>> And I think reading one line each time is less efficient, am I right?
>>
>>
>>
>Not necessarily. Try this:
>
> f = file("filename.txt")
> for line in f:
> if token in line: # or whatever you need to identify it
> break
> else:
> sys.exit("File does not contain token")
> line = f.next()
>
>Then line will be the one you want. Since this will use code written in 
>C to do the processing you will probably be pleasantly surprised by its 
>speed. Only if this isn't fast enough should you consider anything more 
>complicated.
>
>Premature optimizations can waste huge amounts of unnecessary 
>programming time. Don't do it. First try measuring a solution that works!
>  
>
  Oh yes, thanks.

>regards
>  Steve
>  
>
  First, I must say thanks to all of you. And I'm really sorry that I 
didn't
  describe my problem clearly.

  There are many tokens in the file, every time I find a token, I have 
to get
  the data on the next line and do some operation with it. It should be easy
  for me to find just one token using the above method, but there are 
more than
  one.

  My method was:
 
  f_in = open('input_file', 'r')
  data_all = f_in.readlines()
  f_in.close()

  for i in range(len(data_all)):
  line = data[i]
  if token in line:
  # do something with data[i + 1]

  Since my method needs to read all the file into memeory, I think it 
may be not
  efficient when processing very big file.

  I really appreciate all suggestions! Thanks again.

  Regrads,

  xiaojf

 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-20 Thread [EMAIL PROTECTED]

Xiao Jianfeng wrote:
>   First, I must say thanks to all of you. And I'm really sorry that I
> didn't
>   describe my problem clearly.
>
>   There are many tokens in the file, every time I find a token, I have
> to get
>   the data on the next line and do some operation with it. It should be easy
>   for me to find just one token using the above method, but there are
> more than
>   one.
>
>   My method was:
>
>   f_in = open('input_file', 'r')
>   data_all = f_in.readlines()
>   f_in.close()
>
>   for i in range(len(data_all)):
>   line = data[i]
>   if token in line:
>   # do something with data[i + 1]
>
>   Since my method needs to read all the file into memeory, I think it
> may be not
>   efficient when processing very big file.
>
>   I really appreciate all suggestions! Thanks again.
>
something like this :

for x in fh:
  if not has_token(x): continue
  else: process(fh.next())

you can also create an iterator by iter(fh), but I don't think that is
necessary

using the "side effect" to your advantage. I was bite before for the
iterator's side effect but for your particular apps, it becomes an
advantage.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-20 Thread Xiao Jianfeng
[EMAIL PROTECTED] wrote:

>Xiao Jianfeng wrote:
>  
>
>>  First, I must say thanks to all of you. And I'm really sorry that I
>>didn't
>>  describe my problem clearly.
>>
>>  There are many tokens in the file, every time I find a token, I have
>>to get
>>  the data on the next line and do some operation with it. It should be easy
>>  for me to find just one token using the above method, but there are
>>more than
>>  one.
>>
>>  My method was:
>>
>>  f_in = open('input_file', 'r')
>>  data_all = f_in.readlines()
>>  f_in.close()
>>
>>  for i in range(len(data_all)):
>>  line = data[i]
>>  if token in line:
>>  # do something with data[i + 1]
>>
>>  Since my method needs to read all the file into memeory, I think it
>>may be not
>>  efficient when processing very big file.
>>
>>  I really appreciate all suggestions! Thanks again.
>>
>>
>>
>something like this :
>
>for x in fh:
>  if not has_token(x): continue
>  else: process(fh.next())
>
>you can also create an iterator by iter(fh), but I don't think that is
>necessary
>
>using the "side effect" to your advantage. I was bite before for the
>iterator's side effect but for your particular apps, it becomes an
>advantage.
>  
>
  Thanks all of you!

  I have compared the two methods,
  (1). "for x in fh:" 
  (2). read all the file into memory firstly.

  I have tested the two methods on two files, one is 80M and the second 
one is 815M.
  The first method gained a speedup of about 40% for the first file, and 
a speedup
  of about 25% for the second file.

  Sorry for my bad English, and I hope I haven't made people confused.

  Regards,

  xiaojf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-20 Thread [EMAIL PROTECTED]

Xiao Jianfeng wrote:
>   I have compared the two methods,
>   (1). "for x in fh:"
>   (2). read all the file into memory firstly.
>
>   I have tested the two methods on two files, one is 80M and the second
> one is 815M.
>   The first method gained a speedup of about 40% for the first file, and
> a speedup
>   of about 25% for the second file.
>
>   Sorry for my bad English, and I hope I haven't made people confused.

So is the problem solved ?

Putting buffering implementation aside, (1) is the way to go as it runs
through content only once.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

2005-11-20 Thread Xiao Jianfeng
[EMAIL PROTECTED] wrote:

>Xiao Jianfeng wrote:
>  
>
>>  I have compared the two methods,
>>  (1). "for x in fh:"
>>  (2). read all the file into memory firstly.
>>
>>  I have tested the two methods on two files, one is 80M and the second
>>one is 815M.
>>  The first method gained a speedup of about 40% for the first file, and
>>a speedup
>>  of about 25% for the second file.
>>
>>  Sorry for my bad English, and I hope I haven't made people confused.
>>
>>
>
>So is the problem solved ?
>  
>
  Yes, thank you.

>Putting buffering implementation aside, (1) is the way to go as it runs
>through content only once.
>
>  
>
  I think so :-)



  Regards,

  xiaojf

-- 
http://mail.python.org/mailman/listinfo/python-list