what happens when the file begin read is too big for all lines to be read with "readlines()"
HI - Sorry for maybe a too simple a question but I googled and also checked my reference O'Reilly Learning Python book and I did not find a satisfactory answer. When I use readlines, what happens if the number of lines is huge?I have a very big file (4GB) I want to read in, but I'm sure there must be some limitation to readlines and I'd like to know how it is handled by python. I am using it like this: slines = infile.readlines() # reads all lines into a list of strings called "slines" Thanks for anyone who knows the answer to this one. -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
newer python should use "for x in fh:", according to the doc : fh = open("your file") for x in fh: print x which would only read one line at a time. Ross Reyes wrote: > HI - > Sorry for maybe a too simple a question but I googled and also checked my > reference O'Reilly Learning Python > book and I did not find a satisfactory answer. > > When I use readlines, what happens if the number of lines is huge?I have > a very big file (4GB) I want to > read in, but I'm sure there must be some limitation to readlines and I'd > like to know how it is handled by python. > I am using it like this: > slines = infile.readlines() # reads all lines into a list of strings called > "slines" > > Thanks for anyone who knows the answer to this one. -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be?read with "readlines()"
Ross Reyes <[EMAIL PROTECTED]> wrote: > Sorry for maybe a too simple a question but I googled and also > checked my reference O'Reilly Learning Python book and I did not > find a satisfactory answer. The Python documentation is online, and it's good to get familiar with it: http://docs.python.org/> It's even possible to tell Google to search only that site with "site:docs.python.org" as a search term. > When I use readlines, what happens if the number of lines is huge? > I have a very big file (4GB) I want to read in, but I'm sure there > must be some limitation to readlines and I'd like to know how it is > handled by python. The documentation on methods of the 'file' type describes the 'readlines' method, and addresses this concern. http://docs.python.org/lib/bltin-file-objects.html#l2h-244> -- \ "If you're not part of the solution, you're part of the | `\ precipitate." -- Steven Wright | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
Just try it, it is not that hard ... ;-) /Jean Brouwers PS) Here is what happens on Linux: $ limit vmemory 1 $ python ... >>> s = file().readlines() Traceback (most recent call last): File "", line 1 in ? MemoryError >>> -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
[EMAIL PROTECTED] wrote: >newer python should use "for x in fh:", according to the doc : > >fh = open("your file") >for x in fh: print x > >which would only read one line at a time. > > > I have some other questions: when "fh" will be closed? And what shoud I do if I want to explicitly close the file immediately after reading all data I want? >Ross Reyes wrote: > > >>HI - >>Sorry for maybe a too simple a question but I googled and also checked my >>reference O'Reilly Learning Python >>book and I did not find a satisfactory answer. >> >>When I use readlines, what happens if the number of lines is huge?I have >>a very big file (4GB) I want to >>read in, but I'm sure there must be some limitation to readlines and I'd >>like to know how it is handled by python. >>I am using it like this: >>slines = infile.readlines() # reads all lines into a list of strings called >>"slines" >> >>Thanks for anyone who knows the answer to this one. >> >> > > > -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote: > I have some other questions: > > when "fh" will be closed? When all references to the file are no longer in scope: def handle_file(name): fp = file(name, "r") # reference to file now in scope do_stuff(fp) return fp f = handle_file("myfile.txt) # reference to file is now in scope f = None # reference to file is no longer in scope At this point, Python *may* close the file. CPython currently closes the file as soon as all references are out of scope. JPython does not -- it will close the file eventually, but you can't guarantee when. > And what shoud I do if I want to explicitly close the file immediately > after reading all data I want? That is the best practice. f.close() -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
Steven D'Aprano wrote: >On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote: > > > >> I have some other questions: >> >> when "fh" will be closed? >> >> > >When all references to the file are no longer in scope: > >def handle_file(name): >fp = file(name, "r") ># reference to file now in scope >do_stuff(fp) >return fp > > >f = handle_file("myfile.txt) ># reference to file is now in scope >f = None ># reference to file is no longer in scope > >At this point, Python *may* close the file. CPython currently closes the >file as soon as all references are out of scope. JPython does not -- it >will close the file eventually, but you can't guarantee when. > > > >> And what shoud I do if I want to explicitly close the file immediately >>after reading all data I want? >> >> > >That is the best practice. > >f.close() > > > > Let me introduce my problem I came across last night first. I need to read a file(which may be small or very big) and to check line by line to find a specific token, then the data on the next line will be what I want. If I use readlines(), it will be a problem when the file is too big. If I use "for line in OPENED_FILE:" to read one line each time, how can I get the next line when I find the specific token? And I think reading one line each time is less efficient, am I right? Regards, xiaojf -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
Xiao Jianfeng wrote: > Steven D'Aprano wrote: > > >>On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote: >> >> >> >> >>>I have some other questions: >>> >>>when "fh" will be closed? >>> >>> >> >>When all references to the file are no longer in scope: >> >>def handle_file(name): >> fp = file(name, "r") >> # reference to file now in scope >> do_stuff(fp) >> return fp >> >> >>f = handle_file("myfile.txt) >># reference to file is now in scope >>f = None >># reference to file is no longer in scope >> >>At this point, Python *may* close the file. CPython currently closes the >>file as soon as all references are out of scope. JPython does not -- it >>will close the file eventually, but you can't guarantee when. >> >> >> >> >>>And what shoud I do if I want to explicitly close the file immediately >>>after reading all data I want? >>> >>> >> >>That is the best practice. >> >>f.close() >> >> >> >> > > Let me introduce my problem I came across last night first. > > I need to read a file(which may be small or very big) and to check line > by line > to find a specific token, then the data on the next line will be what I > want. > > If I use readlines(), it will be a problem when the file is too big. > > If I use "for line in OPENED_FILE:" to read one line each time, how can > I get > the next line when I find the specific token? > And I think reading one line each time is less efficient, am I right? > Not necessarily. Try this: f = file("filename.txt") for line in f: if token in line: # or whatever you need to identify it break else: sys.exit("File does not contain token") line = f.next() Then line will be the one you want. Since this will use code written in C to do the processing you will probably be pleasantly surprised by its speed. Only if this isn't fast enough should you consider anything more complicated. Premature optimizations can waste huge amounts of unnecessary programming time. Don't do it. First try measuring a solution that works! regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/ -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
On Sun, 20 Nov 2005 12:28:07 +0800, Xiao Jianfeng wrote: > Let me introduce my problem I came across last night first. > > I need to read a file(which may be small or very big) and to check line > by line > to find a specific token, then the data on the next line will be what I > want. > > If I use readlines(), it will be a problem when the file is too big. > > If I use "for line in OPENED_FILE:" to read one line each time, how can > I get > the next line when I find the specific token? Here is one solution using a flag: done = False for line in file("myfile", "r"): if done: break done = line == "token\n" # note the newline # we expect Python to close the file when we exit the loop if done: DoSomethingWith(line) # the line *after* the one with the token else: print "Token not found!" Here is another solution, without using a flag: def get_line(filename, token): """Returns the next line following a token, or None if not found. Leading and trailing whitespace is ignored when looking for the token. """ fp = file(filename, "r") for line in fp: if line.strip() == token: break else: # runs only if we didn't break print "Token not found" result = None result = fp.readline() # read the next line only fp.close() return result Here is a third solution that raises an exception instead of printing an error message: def get_line(filename, token): for line in file(filename, "r"): if line.strip() == token: break else: raise ValueError("Token not found") return fp.readline() # we rely on Python to close the file when we are done > And I think reading one line each time is less efficient, am I right? Less efficient than what? Spending hours or days writing more complex code that only saves you a few seconds, or even runs slower? I believe Python will take advantage of your file system's buffering capabilities. Try it and see, you'll be surprised how fast it runs. If you try it and it is too slow, then come back and we'll see what can be done to speed it up. But don't try to speed it up before you know if it is fast enough. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
On Sun, 20 Nov 2005 16:10:58 +1100, Steven D'Aprano wrote: > def get_line(filename, token): > """Returns the next line following a token, or None if not found. > Leading and trailing whitespace is ignored when looking for > the token. > """ > fp = file(filename, "r") > for line in fp: > if line.strip() == token: > break > else: > # runs only if we didn't break > print "Token not found" > result = None > result = fp.readline() # read the next line only > fp.close() > return result Correction: checking the Library Reference, I find that this is wrong. The reason is that file objects implement their own read-ahead buffer, and mixing calls to next() and readline() may not work right. See http://docs.python.org/lib/bltin-file-objects.html Replace the fp.readline() with fp.next() and all should be good. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
Steve Holden wrote: >Xiao Jianfeng wrote: > > >>Steven D'Aprano wrote: >> >> >> >> >>>On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote: >>> >>> >>> >>> >>> >>> I have some other questions: when "fh" will be closed? >>>When all references to the file are no longer in scope: >>> >>>def handle_file(name): >>> fp = file(name, "r") >>> # reference to file now in scope >>> do_stuff(fp) >>> return fp >>> >>> >>>f = handle_file("myfile.txt) >>># reference to file is now in scope >>>f = None >>># reference to file is no longer in scope >>> >>>At this point, Python *may* close the file. CPython currently closes the >>>file as soon as all references are out of scope. JPython does not -- it >>>will close the file eventually, but you can't guarantee when. >>> >>> >>> >>> >>> >>> And what shoud I do if I want to explicitly close the file immediately after reading all data I want? >>>That is the best practice. >>> >>>f.close() >>> >>> >>> >>> >>> >>> >> Let me introduce my problem I came across last night first. >> >> I need to read a file(which may be small or very big) and to check line >>by line >> to find a specific token, then the data on the next line will be what I >>want. >> >> If I use readlines(), it will be a problem when the file is too big. >> >> If I use "for line in OPENED_FILE:" to read one line each time, how can >>I get >> the next line when I find the specific token? >> And I think reading one line each time is less efficient, am I right? >> >> >> >Not necessarily. Try this: > > f = file("filename.txt") > for line in f: > if token in line: # or whatever you need to identify it > break > else: > sys.exit("File does not contain token") > line = f.next() > >Then line will be the one you want. Since this will use code written in >C to do the processing you will probably be pleasantly surprised by its >speed. Only if this isn't fast enough should you consider anything more >complicated. > >Premature optimizations can waste huge amounts of unnecessary >programming time. Don't do it. First try measuring a solution that works! > > Oh yes, thanks. >regards > Steve > > First, I must say thanks to all of you. And I'm really sorry that I didn't describe my problem clearly. There are many tokens in the file, every time I find a token, I have to get the data on the next line and do some operation with it. It should be easy for me to find just one token using the above method, but there are more than one. My method was: f_in = open('input_file', 'r') data_all = f_in.readlines() f_in.close() for i in range(len(data_all)): line = data[i] if token in line: # do something with data[i + 1] Since my method needs to read all the file into memeory, I think it may be not efficient when processing very big file. I really appreciate all suggestions! Thanks again. Regrads, xiaojf -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
Xiao Jianfeng wrote: > First, I must say thanks to all of you. And I'm really sorry that I > didn't > describe my problem clearly. > > There are many tokens in the file, every time I find a token, I have > to get > the data on the next line and do some operation with it. It should be easy > for me to find just one token using the above method, but there are > more than > one. > > My method was: > > f_in = open('input_file', 'r') > data_all = f_in.readlines() > f_in.close() > > for i in range(len(data_all)): > line = data[i] > if token in line: > # do something with data[i + 1] > > Since my method needs to read all the file into memeory, I think it > may be not > efficient when processing very big file. > > I really appreciate all suggestions! Thanks again. > something like this : for x in fh: if not has_token(x): continue else: process(fh.next()) you can also create an iterator by iter(fh), but I don't think that is necessary using the "side effect" to your advantage. I was bite before for the iterator's side effect but for your particular apps, it becomes an advantage. -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
[EMAIL PROTECTED] wrote: >Xiao Jianfeng wrote: > > >> First, I must say thanks to all of you. And I'm really sorry that I >>didn't >> describe my problem clearly. >> >> There are many tokens in the file, every time I find a token, I have >>to get >> the data on the next line and do some operation with it. It should be easy >> for me to find just one token using the above method, but there are >>more than >> one. >> >> My method was: >> >> f_in = open('input_file', 'r') >> data_all = f_in.readlines() >> f_in.close() >> >> for i in range(len(data_all)): >> line = data[i] >> if token in line: >> # do something with data[i + 1] >> >> Since my method needs to read all the file into memeory, I think it >>may be not >> efficient when processing very big file. >> >> I really appreciate all suggestions! Thanks again. >> >> >> >something like this : > >for x in fh: > if not has_token(x): continue > else: process(fh.next()) > >you can also create an iterator by iter(fh), but I don't think that is >necessary > >using the "side effect" to your advantage. I was bite before for the >iterator's side effect but for your particular apps, it becomes an >advantage. > > Thanks all of you! I have compared the two methods, (1). "for x in fh:" (2). read all the file into memory firstly. I have tested the two methods on two files, one is 80M and the second one is 815M. The first method gained a speedup of about 40% for the first file, and a speedup of about 25% for the second file. Sorry for my bad English, and I hope I haven't made people confused. Regards, xiaojf -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
Xiao Jianfeng wrote: > I have compared the two methods, > (1). "for x in fh:" > (2). read all the file into memory firstly. > > I have tested the two methods on two files, one is 80M and the second > one is 815M. > The first method gained a speedup of about 40% for the first file, and > a speedup > of about 25% for the second file. > > Sorry for my bad English, and I hope I haven't made people confused. So is the problem solved ? Putting buffering implementation aside, (1) is the way to go as it runs through content only once. -- http://mail.python.org/mailman/listinfo/python-list
Re: what happens when the file begin read is too big for all lines to be read with "readlines()"
[EMAIL PROTECTED] wrote: >Xiao Jianfeng wrote: > > >> I have compared the two methods, >> (1). "for x in fh:" >> (2). read all the file into memory firstly. >> >> I have tested the two methods on two files, one is 80M and the second >>one is 815M. >> The first method gained a speedup of about 40% for the first file, and >>a speedup >> of about 25% for the second file. >> >> Sorry for my bad English, and I hope I haven't made people confused. >> >> > >So is the problem solved ? > > Yes, thank you. >Putting buffering implementation aside, (1) is the way to go as it runs >through content only once. > > > I think so :-) Regards, xiaojf -- http://mail.python.org/mailman/listinfo/python-list