Re: [Tutor] Load Entire File into memory
On Mon, Nov 4, 2013 at 11:26 AM, Amal Thomas wrote: > @Dave: thanks.. By the way I am running my codes on a server with about > 100GB ram but I cant afford my code to use 4-5 times the size of the text > file. Now I am using read() / readlines() , these seems to be more > efficient in memory usage than io.StringIO(f.read()). f.read() creates a string to initialize a StringIO object. You could instead initialize a BytesIO object with a mapped file; that should cut the peak RSS down by half. If you need decoded text, add a TextIOWrapper. import io import mmap with open('output.txt') as f: with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mf: content = io.TextIOWrapper(io.BytesIO(mf)) for line in content: 'process line' However, before you do something extreme (like say... loading a 50 GiB file into RAM), try tweaking the TextIOWrapper object's readline() by increasing _CHUNK_SIZE. This can be up to 2**63-1 in a 64-bit process. with open('output.txt') as content: content._CHUNK_SIZE = 65536 for line in content: 'process line' Check content.buffer.tell() to confirm that the file pointer is increasing in steps of the given chunk size. Built-in open() also lets you set the "buffering" size for the BufferedReader, content.buffer. However, in this case I don't think you need to worry about it. content.readline() calls content.buffer.read1() to read directly from the FileIO object, content.buffer.raw. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Nov 5, 2013, at 11:12 AM, Alan Gauld wrote: > On 05/11/13 02:02, Danny Yoo wrote: > >> To visualize the sheer scale of the problem, see: >> >> http://i.imgur.com/X1Hi1.gif >> >> which would normally be funny, except that it's not quite a joke. :P > > I think I'm missing something. All I see in Firefox is > a vertical red bar. And in Chrome I don't even get that, > just a blank screen... > > ??? > > -- > Alan G > Author of the Learn to Program web site > http://www.alan-g.me.uk/ > http://www.flickr.com/photos/alangauldphotos It took me a while… If you put your cursor up in the extreme upper left hand corner of that red bar, you get a + sign that allows you to expand the image. In the expansion you will see text that explains the graphical scales in question. A pixel (L1 cache), a short bar of pixels (L2 cache), a longer bar (RAM) and finally that huge block of pixels that represent disk latency. -Bill ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Tue, Nov 05, 2013 at 04:12:51PM +, Alan Gauld wrote: > On 05/11/13 02:02, Danny Yoo wrote: > > >To visualize the sheer scale of the problem, see: > > > >http://i.imgur.com/X1Hi1.gif > > > >which would normally be funny, except that it's not quite a joke. :P > > I think I'm missing something. All I see in Firefox is > a vertical red bar. And in Chrome I don't even get that, > just a blank screen... Can't speak for Chrome, sounds like a bug. But Firefox defaults to scaling pictures to fit within the window. If you mouse over the image, you'll get a magnifying glass pointer. Click on the image, and it will redisplay at full size. Scroll to the very top, and you will find a little bit of text, a single red pixel representing the latency of cache memory, a dozen or so red pixels representing the latency of RAM, and a monsterously huge block of thousands and thousands and thousands of red pixels representing the latency of hard drives. Feel free to scroll and count the pixels :-) -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On 05/11/13 02:02, Danny Yoo wrote: To visualize the sheer scale of the problem, see: http://i.imgur.com/X1Hi1.gif which would normally be funny, except that it's not quite a joke. :P I think I'm missing something. All I see in Firefox is a vertical red bar. And in Chrome I don't even get that, just a blank screen... ??? -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On 5 November 2013 13:20, Amal Thomas wrote: > On Mon, Nov 4, 2013 at 10:00 PM, Steven D'Aprano > wrote: >> > >> >> import os >> filename = "YOUR FILE NAME HERE" >> print("File size:", os.stat(filename).st_size) >> f = open(filename) >> content = f.read() >> print("Length of content actually read:", len(content)) >> print("Current file position:", f.tell()) >> f.close() >> >> >> and send us the output. > > > This is the output: >File size: 50297501884 >Length of content actually read: 50297501884 >Current file position: 50297501884 > This Code used 61.4 GB RAM and 59.6 GB swap (I had ensured that no other > important process were running in my server before running this :D) If you are using the swap then this will almost certainly be the biggest slowdown in your program (and any other program on the same machine). If there is some way to reorganise your code so that you don't need to load everything into memory then (if you do it right) you should be able to make it *much* faster. Oscar ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On 4 November 2013 17:41, Amal Thomas wrote: > @Steven: Thank you...My input data is basically AUGC and newlines... I would > like to know about bytearray technique. Please suggest me some links or > reference.. I will go through the profiler and check whether the code > maintains linearity with the input files. Amal can you just give *some* explanation of what you're doing? I can think of many possible ways to optimise it or to use less memory but it depends on what you're actually doing and you really haven't given enough information. If you show a short data sample and a short piece of code that is similar to what you are doing then people here would be in a better position to help. Please read this: http://sscce.org/ You have repeatedly made claims such as "X is faster than Y" but you've also said that you are a beginner to Python. Be aware that testing speeds on a small file is completely different from testing on a large file or that testing speeds on a file that is already cached by the OS is completely different from testing on one that is not. There are many things that can make the difference. Oscar ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Mon, Nov 4, 2013 at 10:00 PM, Steven D'Aprano wrote: > > > import os > filename = "YOUR FILE NAME HERE" > print("File size:", os.stat(filename).st_size) > f = open(filename) > content = f.read() > print("Length of content actually read:", len(content)) > print("Current file position:", f.tell()) > f.close() > > > and send us the output. This is the output: File size: 50297501884 Length of content actually read: 50297501884 Current file position: 50297501884 This Code used 61.4 GB RAM and 59.6 GB swap (I had ensured that no other important process were running in my server before running this :D) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
Amal Thomas, 04.11.2013 14:55: > I have checked the execution time manually as well as I found it through my > code. During execution of my code, at start, I stored my initial time(start > time) to a variable and at the end calculated time taken to run the code = > end time - start time. There was a significance difference in time. You should make sure that there are no caching effects here. Your operating system may have loaded the file into memory (assuming that you have enough of that) after the first read and then served it from there when you ran the second benchmark. So, make sure you measure the time twice for both, preferably running both benchmarks in reverse order the second time. That being said, it's not impossible that f.readlines() is faster than line-wise iteration, because it knows right from the start that it will read the entire file, so it can optimise for it (didn't check if it actually does, this might have changed in Py3.3, for example). Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Mon, Nov 04, 2013 at 06:02:47PM -0800, Danny Yoo wrote: > To visualize the sheer scale of the problem, see: > > http://i.imgur.com/X1Hi1.gif > > which would normally be funny, except that it's not quite a joke. :P Nice visualisation! Was that yours? > So you want to minimize hard disk usage as much as possible. "Thrashing" > is precisely the situation you do not want to have when running a large > analysis. Yes, thrashing is a disaster for performance. But I think that hard drive latency fails to demonstrate just how big a disaster. True, hard drive access time is about 100 times slower than RAM. But that's just a constant scale factor. What really, really kills performance with thrashing is that the memory manager is trying to move chunks of memory around, and the overall "move smaller blocks of memory around to make space for a really big block" algorithm ends up with quadratic (or worse!) performance. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
I mustly agree with Alan, but a couple of little quibbles: On Tue, Nov 05, 2013 at 01:10:39AM +, ALAN GAULD wrote: > >@Alan: Thanks.. I have checked the both ways( reading line by line by not > >loading into ram , > > other loading entire file to ram and then reading line by line) for files > > with 2-3GB. > > OK, But 2-3G will nearly always live entirely in RAM on a modern computer. Speak for yourself. Some of us are still using "modern computers" with 1-2 GB of RAM :-( > > Only change which i have done is in the reading part , rest of the code was > > kept same. > > There was significant time difference. Please note that I started this > > thread stating that > > when I am using io.StringIO(f.read()) in code it uses a memory of almost > > 4-5 times the > > input file size. Now using read() or readlines() it has reduced to 1.5 > > times... > > Yes a raw string is always going to be more efficient in memory use than > StringIO. It depends what you're doing with it. The beauty of StringIO is that it emulates an in-memory file, so you can modify it in place. String objects are immutable and cannot be modified in place, so if you have to make changes to it, you have to make a copy with the change. For large strings, say, over 100MB, the overhead can get painful. > > Also as I have mentioned I cant afford to run my code using 4-5 times > > memory. > > Total resource available in my server is about 180 GB memory (approx 64 GB > > RAM + 128GB swap). > > OK, There is a huge difference between having 100G of RAM and having 64G+128G > swap. > swap is basically disk so if you are reading your data into memory and that > memory is > bouncing in and out of swap things will slow down by an order of magnitude. At least. Hard drive technology is more like two or even three orders of magnitude slower than RAM access (100 or 1000 times slower), and including the overhead of the memory manager moving things about, there is no upper limit to how large the penalty can be. If you get away with only 10 times slower, you're lucky. In my experience, 100-1000 times slower is more common (although my experience is on machines with fairly small amounts of RAM in the first place) and sometimes slow enough that even the operating system stops responding. Plan to avoid using swap space :-) > You need to try to optimise to use real RAM and minimise use of swap. Agreed. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
> > You _must_ avoid swap at all costs here. You may not understand the > point, so a little more explanation: touching swap is several orders of > magnitude more expensive than anything else you are doing in your program. > > CPU operations are on the order of nanoseconds. (10^-9) > > Disk operations are on the order of milliseconds. (10^-3) > > References: > > http://en.wikipedia.org/wiki/Instructions_per_second > > http://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics > > To visualize the sheer scale of the problem, see: http://i.imgur.com/X1Hi1.gif which would normally be funny, except that it's not quite a joke. :P So you want to minimize hard disk usage as much as possible. "Thrashing" is precisely the situation you do not want to have when running a large analysis. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
> > > > Also as I have mentioned I cant afford to run my code using 4-5 times > memory. > > Total resource available in my server is about 180 GB memory (approx 64 > GB RAM + 128GB swap). > > OK, There is a huge difference between having 100G of RAM and having > 64G+128G swap. > swap is basically disk so if you are reading your data into memory and > that memory is > bouncing in and out of swap things will slow down by an order of > magnitude. > You need to try to optimise to use real RAM and minimise use of swap. > I concur with Alan, and want to state his point more forcefully. If you are hitting swap, you are computationally DOOMED and must do something different. You _must_ avoid swap at all costs here. You may not understand the point, so a little more explanation: touching swap is several orders of magnitude more expensive than anything else you are doing in your program. CPU operations are on the order of nanoseconds. (10^-9) Disk operations are on the order of milliseconds. (10^-3) References: http://en.wikipedia.org/wiki/Instructions_per_second http://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics As soon as you start touching your swap space to simulate virtual memory, you've lost the battle. We were trying not to leap to conclusions till we knew more. Now we know more. If your system has much less RAM than can fit your dataset at once, trying to read it all at once on your single machine, into an in-memory buffer, is wrong. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
Forwarding to tutor list. Please use Reply All in responses. From: Amal Thomas >To: Alan Gauld >Sent: Monday, 4 November 2013, 17:26 >Subject: Re: [Tutor] Load Entire File into memory > > > >@Alan: Thanks.. I have checked the both ways( reading line by line by not >loading into ram , > other loading entire file to ram and then reading line by line) for files > with 2-3GB. OK, But 2-3G will nearly always live entirely in RAM on a modern computer. > Only change which i have done is in the reading part , rest of the code was > kept same. > There was significant time difference. Please note that I started this thread > stating that > when I am using io.StringIO(f.read()) in code it uses a memory of almost 4-5 > times the > input file size. Now using read() or readlines() it has reduced to 1.5 > times... Yes a raw string is always going to be more efficient in memory use than StringIO. > Also as I have mentioned I cant afford to run my code using 4-5 times memory. > Total resource available in my server is about 180 GB memory (approx 64 GB > RAM + 128GB swap). OK, There is a huge difference between having 100G of RAM and having 64G+128G swap. swap is basically disk so if you are reading your data into memory and that memory is bouncing in and out of swap things will slow down by an order of magnitude. You need to try to optimise to use real RAM and minimise use of swap. > So before starting to process my 30-50GB input files I am keen to know the > best way. Performance tuning is always a tricky topic and needs to be done on a case by case basis. There are simply too many factors to try to say that method (A) will always be faster than method (B) It depends on the nature of the data, its source, your target data structures, your algorithms, your output format and target etc as well as the physical machines being used... We need a lot more detail about the task before we can give any solid advice. And even then you should verify it before assuming you are done. Alan G. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Mon, Nov 4, 2013 at 9:41 AM, Amal Thomas wrote: > @Steven: Thank you...My input data is basically AUGC and newlines... I > would like to know about bytearray technique. Please suggest me some links > or reference.. I will go through the profiler and check whether the code > maintains linearity with the input files. > > Hi Amal, I suspect that what's been missing here throughout this thread is more concrete information about the problem's background. I would strongly suggest we make sure that we understand the problem before making more assumptions. 1. What is the nature of the operation that you are doing on your data? Can you briefly discuss its details? Does it involve random-access, or is it a sequential operation? Are the operations independent regardless of what line you are on, or is there some kind of dependency across lines? Does it involve pattern matching, or...? Are you maintaining some in-memory data structure as you're walking through the file? The reason why we need to know this is because it can affect file access patterns. It may provide a hint as to whether or not you can avoid loading the whole file into memory or not. It may even effect whether or not you can distribute your work among several computers. Here's also why it's important to talk more about what the problem is trying to solve. Your question has been assuming that the dominating factor in your program's runtime is the access of your data, and that loading the entire file into memory will improve performance. But I see no evidence to support that assumption yet. Why should I not believe that the time that's being spent isn't being spent paging in virtual memory, for example, due to something else in your program's operations? In which case, then trying to load the file entirely into memory will be counterproductive. 2. What is the format of your input data? You mention it is AUGC and newlines, but more details would be really helpful. Why is it line-oriented, for example? I mean that as a serious question. Is it significant? Is it a FASTA file? Is it some kind of homebrewed format? Please be as specific as you can be here: you may be duplicating effort that folks who have spent _years_ on sequence-reading libraries have already done for you. Specifically, you might be able to reuse Biopython's libraries for sequence IO. http://biopython.org/wiki/SeqIO By trying to cook up file parsing by yourself, you may be making a mistake. For example, there might be issues in Python 3 due to Unicode encodings: http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit which might contribute to an unexpected increase in the size of a string's memory representation. Hard to say, since it depends on a host of factors. But knowing that, other folks have probably encountered and solved this problem already. Concretely, I'm pretty sure Biopython's SeqIO does the Right Thing in terms of reading files in binary mode and reading the line contents as bytes, as opposed to regular strings, and representing the sequence in some memory-efficient way. At the very least, I know that they think about these kind of problems a lot: http://web.archiveorange.com/archive/v/5dAwXDMfufikePQqtPgx Probably a lot more than us. :P So if it's possible, try to leverage what's already out there. You should almost certainly not be writing your own sequence-reading code. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On 4/11/2013 11:26, Amal Thomas wrote: > @Dave: thanks.. By the way I am running my codes on a server with about > 100GB ram but I cant afford my code to use 4-5 times the size of the text > file. Now I am using read() / readlines() , these seems to be more > efficient in memory usage than io.StringIO(f.read()). > Sorry I misspoke about read() on a large file. I was confusing it with something else. However, note that in any environment if you have a large buffer, and you force the system to copy that large buffer, you'll be using (temporarily at least) twice the space. And usually the original can't be freed, for various technical reasons. The real question is how you're going to be addressing the data, and wha.t constraints are on that data. Since you think you need it all in memory, you clearly are planning to access it randomly. Since the data is apparently ASCII characters, and you're running at least 3.3, you won't be paying the penalty if it turns out to be strings. But there may be alternate ways of encoding each line which save space and/or make it faster to use. One big buffer imaging the file is likely to be one of the worst. Are the lines variable length? Do you ever deal randomly with a portion of a line, or only the whole thing? If the line is multiple ASCII characters, are their order significant? how many different symbols can appear in a single line? how many different ones total? (probably excluding the newline). What's the average line length? Each of these questions may lead to exploring different optimzation strategies. But I've done enough speculating. -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Tue, 5 Nov 2013 02:53:41 +1100, Steven D'Aprano wrote: Dave, do you have a reference for that? As far as I can tell, read() will read to EOF unless you open the file in non-blocking mode. No. I must be just remembering something from another language. Sorry. -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
@Steven: Thank you...My input data is basically AUGC and newlines... I would like to know about bytearray technique. Please suggest me some links or reference.. I will go through the profiler and check whether the code maintains linearity with the input files. > > It's probably worth putting some more timing statements into your code > > to see where the time is going because it's not the reading from the > > disk that's the problem. > > The first thing I would do is run the code on three smaller sample > files: > > 50MB > 100MB > 200MB > > The time taken should approximately double as you double the size of the > file: say it takes 2 hours to process the 50MB file, 4 hours for the > 100MB file and 8 hours for the 200 MB file, that's linear performance > and isn't too bad. > > But if performance isn't linear, say 2 hours, 4 hours, 16 hours, then > you're in trouble and you *desperately* need to reconsider the algorithm > being used. Either that, or just accept that this is an inherently slow > calculation and it will take a week or two. > > Amal, another thing you should try is use the Python profiler on your > code (again, on a smaller sample file). The profiler will show you where > the time is being spent. > > Unfortunately the profiler may slow your code down, so it is important > to use it on manageable sized data. The profiler is explained here: > > http://docs.python.org/3/library/profile.html > > If you need any help, don't hesitate to ask. > > > > >trying to optimize my code to get the outputs in less time and memory > > >efficiently. > > > > Memory efficiency is easy, do it line by line off the disk. > > This assumes that you can process one line at a time, sequentially. I > expect that is not the case. > > > -- > Steven > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > -- *AMAL THOMAS* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Mon, Nov 04, 2013 at 04:54:16PM +, Alan Gauld wrote: > On 04/11/13 16:34, Amal Thomas wrote: > >@Joel: The code runs for weeks..input file which I have to process in > >very huge(in 50 gbs). So its not a matter of hours.its matter of days > >and weeks.. > > OK, but that's not down to reading the file from disk. > Reading a 50G file will only take a few minutes if you have enough RAM, > which seems to be the case. Not really. There is still some uncertainty (at least in my mind!). For instance, I assume that Amal doesn't have sole access to the server. So there could be another dozen users all trying to read 50GB files at once, in a machine with only 100GB of memory... Once the server starts paging, performance will plummett. > If it's taking days/weeks you must be doing > some incredibly time consuming processing. Well, yes, it's biology :-) > It's probably worth putting some more timing statements into your code > to see where the time is going because it's not the reading from the > disk that's the problem. The first thing I would do is run the code on three smaller sample files: 50MB 100MB 200MB The time taken should approximately double as you double the size of the file: say it takes 2 hours to process the 50MB file, 4 hours for the 100MB file and 8 hours for the 200 MB file, that's linear performance and isn't too bad. But if performance isn't linear, say 2 hours, 4 hours, 16 hours, then you're in trouble and you *desperately* need to reconsider the algorithm being used. Either that, or just accept that this is an inherently slow calculation and it will take a week or two. Amal, another thing you should try is use the Python profiler on your code (again, on a smaller sample file). The profiler will show you where the time is being spent. Unfortunately the profiler may slow your code down, so it is important to use it on manageable sized data. The profiler is explained here: http://docs.python.org/3/library/profile.html If you need any help, don't hesitate to ask. > >trying to optimize my code to get the outputs in less time and memory > >efficiently. > > Memory efficiency is easy, do it line by line off the disk. This assumes that you can process one line at a time, sequentially. I expect that is not the case. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On 04/11/13 16:34, Amal Thomas wrote: @Joel: The code runs for weeks..input file which I have to process in very huge(in 50 gbs). So its not a matter of hours.its matter of days and weeks.. OK, but that's not down to reading the file from disk. Reading a 50G file will only take a few minutes if you have enough RAM, which seems to be the case. If it's taking days/weeks you must be doing some incredibly time consuming processing. It's probably worth putting some more timing statements into your code to see where the time is going because it's not the reading from the disk that's the problem. trying to optimize my code to get the outputs in less time and memory efficiently. Memory efficiency is easy, do it line by line off the disk. time efficiency is most likely down to your processing algorithm if we are talking about days. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Mon, Nov 04, 2013 at 11:27:52AM -0500, Joel Goldstick wrote: > If you are new to python why are you so concerned about the speed of > your code. Amal is new to Python but he's not new to biology, he's a 4th year student. With a 50GB file, I expect he is analysing something to do with DNA sequencing, which depending on exactly what he is trying to do could involve O(N) or even O(N**2) algorithms. An O(N) algorithm on a 50GB file, assuming 100,000 steps per second, will take over 5 days to complete. An O(N**2) algorithm, well, it's nearly unthinkable: nearly 800 million years. You *really* don't want O(N**2) algorithms with big data. I would expect that with a big DNA sequencing problem, running time would be measured in days rather than minutes or hours. So yes, this is probably a case where optimizing for speed is not premature. We really don't know enough about his problem to advise him on how to speed it up. If the data file is guaranteed to be nothing but GCTA bases, and newlines, it may be better to read the data file into memory as a bytearray rather than a string. Especially if he needs to modify it in place. But this is getting into some fairly advanced territory, I wouldn't like to predict what will be faster without testing on real data. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
@Steven: Thanks... Right now I cant access the files. I will send you the output when I can. -- Please try this little bit of code, replacing the file name with the actual name of your 50GB data file: import os filename = "YOUR FILE NAME HERE" print("File size:", os.stat(filename).st_size) f = open(filename) content = f.read() print("Length of content actually read:", len(content)) print("Current file position:", f.tell()) f.close() and send us the output. -- *AMAL THOMAS* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
@Joel: The code runs for weeks..input file which I have to process in very huge(in 50 gbs). So its not a matter of hours.its matter of days and weeks..I was using C++. Recently I switched over to Python. I am trying to optimize my code to get the outputs in less time and memory efficiently. On Mon, Nov 4, 2013 at 9:57 PM, Joel Goldstick wrote: > > > > If you are new to python why are you so concerned about the speed of > your code. You never say how long it takes. Do these files take > hours to process? or minutes or seconds?I suggest you write your > code in a way that is clear and understandable, then try to optimize > it if necessary. > > -- *AMAL THOMAS* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Mon, Nov 04, 2013 at 07:00:29PM +0530, Amal Thomas wrote: > Yes I have found that after loading to RAM and then reading lines by lines > saves a huge amount of time since my text files are very huge. This is remarkable, and quite frankly incredible. I wonder whether you are misinterpreting what you are seeing? Under normal circumstances, with all but quite high-end machines, trying to read a 50GB file into memory all at once will be effectively impossible. Suppose your computer has 24GB of RAM. The OS and other running applications can be expected to use some of that, but even ignoring this, it is impossible to read a 50GB file into memory all at once with only 24GB. What I would expect is that unless you have *at least* double the amount of memory as the size of the file (in this case, at least 100GB), either Python will give you a MemoryError, or the operating system will try paging memory into swap-space, which is *painfully* slow. I've been in the situation where I accidently tried reading a file bigger than the installed RAM, and it ran overnight (14+ hours), locked up and stopped responding, and I finally had to unplug the power and restart the machine. So unless you have 100+ GB in your computer, which would put it in seriously high-end server class, I find it difficult to believe that you are actually reading the entire file into memory. Please try this little bit of code, replacing the file name with the actual name of your 50GB data file: import os filename = "YOUR FILE NAME HERE" print("File size:", os.stat(filename).st_size) f = open(filename) content = f.read() print("Length of content actually read:", len(content)) print("Current file position:", f.tell()) f.close() and send us the output. Thanks, -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
"I am new to python. I am working in computational biology and I have to deal with text files of huge size. I know how to read line by line from a text file. I want to know the best method in python3 to load the enire file into ram and do the operations.(since this saves time)" If you are new to python why are you so concerned about the speed of your code. You never say how long it takes. Do these files take hours to process? or minutes or seconds?I suggest you write your code in a way that is clear and understandable, then try to optimize it if necessary. > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor -- Joel Goldstick http://joelgoldstick.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
@Dave: thanks.. By the way I am running my codes on a server with about 100GB ram but I cant afford my code to use 4-5 times the size of the text file. Now I am using read() / readlines() , these seems to be more efficient in memory usage than io.StringIO(f.read()). On Mon, Nov 4, 2013 at 9:23 PM, Steven D'Aprano wrote: > On Mon, Nov 04, 2013 at 02:48:11PM +, Dave Angel wrote: > > > Now I understand. Processing line by line is slower because it actually > > reads the whole file. The code you showed earlier: > > > > >I am currently using this method to load my text file: > > > *f = open("output.txt") > > > content=io.StringIO(f.read()) > > > f.close()* > > > But I have found that this method uses 4 times the size of text file. > > > > will only read a tiny portion of the file. You don't have any loop on > > the read() statement, you just read the first buffer full. So naturally > > it'll be much faster. > > > Dave, do you have a reference for that? As far as I can tell, read() > will read to EOF unless you open the file in non-blocking mode. > > http://docs.python.org/3/library/io.html#io.BufferedIOBase.read > > > > I am of course assuming you don't have a machine with 100+ gig of RAM. > > There is that, of course. High-end servers can have multiple hundreds of > GB of RAM, but desktop and laptop machines rarely have anywhere near > that. > > > -- > Steven > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > -- *AMAL THOMASFourth Year Undergraduate Student Department of Biotechnology IIT KHARAGPUR-721302* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Mon, Nov 04, 2013 at 02:48:11PM +, Dave Angel wrote: > Now I understand. Processing line by line is slower because it actually > reads the whole file. The code you showed earlier: > > >I am currently using this method to load my text file: > > *f = open("output.txt") > > content=io.StringIO(f.read()) > > f.close()* > > But I have found that this method uses 4 times the size of text file. > > will only read a tiny portion of the file. You don't have any loop on > the read() statement, you just read the first buffer full. So naturally > it'll be much faster. Dave, do you have a reference for that? As far as I can tell, read() will read to EOF unless you open the file in non-blocking mode. http://docs.python.org/3/library/io.html#io.BufferedIOBase.read > I am of course assuming you don't have a machine with 100+ gig of RAM. There is that, of course. High-end servers can have multiple hundreds of GB of RAM, but desktop and laptop machines rarely have anywhere near that. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On 4/11/2013 09:04, Amal Thomas wrote: > @William: > Thanks, > > My Line size varies from 40 to 550 characters. Please note that text file > which I have to process is in gigabytes ( approx 50 GB ) . This was the > code which i used to process line by line without loading into memory. Now I understand. Processing line by line is slower because it actually reads the whole file. The code you showed earlier: >I am currently using this method to load my text file: > *f = open("output.txt") > content=io.StringIO(f.read()) > f.close()* > But I have found that this method uses 4 times the size of text file. will only read a tiny portion of the file. You don't have any loop on the read() statement, you just read the first buffer full. So naturally it'll be much faster. I am of course assuming you don't have a machine with 100+ gig of RAM. By the way, would you please stop posting html messages to a text newsgroup? They come out completely blank on my other newsreader (except for the Tutor maillist trailer, which IS text), and I end up having to fire up this one just to read your messages. But even if you don't care about me, you should realize that messages sent as html as very frequently garbled when they're interpreted as text. -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
@William: Thanks, My Line size varies from 40 to 550 characters. Please note that text file which I have to process is in gigabytes ( approx 50 GB ) . This was the code which i used to process line by line without loading into memory. *for lines in open('uniqname.txt'): * * * On Mon, Nov 4, 2013 at 7:16 PM, William Ray Wing wrote: > On Nov 4, 2013, at 8:30 AM, Amal Thomas wrote: > How long are the lines in your file? In particular, are they many > hundreds or thousands of characters long, or are they only few hundred > characters, say 200 or less? > > Unless they are so long as to exceed the normal buffer size of your OS's > read-ahead buffer, I strongly suspect that the big time sink in your > attempt to read line-by-line was some inadvertent inefficiency that you > introduced. Normally, when reading from a text file, python buffers the > reads (or uses the host OS buffering). Those reads pull in huge chunks of > text WAY ahead of where the actual python processing is going on, and are > VERY efficient. > > -Bill -- *AMAL THOMAS* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
Hi, @Peter: I have checked the execution time manually as well as I found it through my code. During execution of my code, at start, I stored my initial time(start time) to a variable and at the end calculated time taken to run the code = end time - start time. There was a significance difference in time. Thanks, On Mon, Nov 4, 2013 at 7:11 PM, Peter Otten <__pete...@web.de> wrote: > Amal Thomas wrote: > > > Yes I have found that after loading to RAM and then reading lines by > lines > > saves a huge amount of time since my text files are very huge. > > How exactly did you find out? You should only see a speed-up if you iterate > over the data at least twice. > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > -- *AMAL THOMASFourth Year Undergraduate Student Department of Biotechnology IIT KHARAGPUR-721302* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On Nov 4, 2013, at 8:30 AM, Amal Thomas wrote: > Yes I have found that after loading to RAM and then reading lines by lines > saves a huge amount of time since my text files are very huge. > [huge snip] > -- > AMAL THOMAS > Fourth Year Undergraduate Student > Department of Biotechnology > IIT KHARAGPUR-721302 > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor How long are the lines in your file? In particular, are they many hundreds or thousands of characters long, or are they only few hundred characters, say 200 or less? Unless they are so long as to exceed the normal buffer size of your OS's read-ahead buffer, I strongly suspect that the big time sink in your attempt to read line-by-line was some inadvertent inefficiency that you introduced. Normally, when reading from a text file, python buffers the reads (or uses the host OS buffering). Those reads pull in huge chunks of text WAY ahead of where the actual python processing is going on, and are VERY efficient. -Bill ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
Amal Thomas wrote: > Yes I have found that after loading to RAM and then reading lines by lines > saves a huge amount of time since my text files are very huge. How exactly did you find out? You should only see a speed-up if you iterate over the data at least twice. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
Yes I have found that after loading to RAM and then reading lines by lines saves a huge amount of time since my text files are very huge. On Mon, Nov 4, 2013 at 6:46 PM, Alan Gauld wrote: > On 04/11/13 13:06, Amal Thomas wrote: > > Present code: >> >> >> *f = open("output.txt") >> content=f.read().split('\n') >> f.close() >> > > If your objective is to save time, then you should replace this with > f.readlines() which will save you reprocesasing the entire file to remove > the newlines. > > for lines in content: >> * * >> *content.clear()* >> > > But if you are processing line by line what makes you think that reading > the entire file into RAM and then reprocessing it is faster than reading it > line by line? > > Have you tried that on aqnother file and measutred any significant > improvement? There are times when reading into RAM is faster but I'm not > sure this will be one of them. > > for line in f: >process line > > may be your best bet. > > *f = open("output.txt") >> content=io.StringIO(f.read()) >> f.close() >> for lines in content: >> >> * >> *content.close()* >> > > -- > Alan G > Author of the Learn to Program web site > http://www.alan-g.me.uk/ > http://www.flickr.com/photos/alangauldphotos > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > -- *AMAL THOMASFourth Year Undergraduate Student Department of Biotechnology IIT KHARAGPUR-721302* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On 04/11/13 13:06, Amal Thomas wrote: Present code: *f = open("output.txt") content=f.read().split('\n') f.close() If your objective is to save time, then you should replace this with f.readlines() which will save you reprocesasing the entire file to remove the newlines. for lines in content: * * *content.clear()* But if you are processing line by line what makes you think that reading the entire file into RAM and then reprocessing it is faster than reading it line by line? Have you tried that on aqnother file and measutred any significant improvement? There are times when reading into RAM is faster but I'm not sure this will be one of them. for line in f: process line may be your best bet. *f = open("output.txt") content=io.StringIO(f.read()) f.close() for lines in content: * *content.close()* -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
Hi, Thanks Alan. Now I have made changes in code : Present code: *f = open("output.txt")content=f.read().split('\n') f.close()for lines in content:* * * *content.clear()* Previous code: *f = open("output.txt") content=io.StringIO(f.read()) f.close()for lines in content: * *content.close()* Now I have found that memory use is roughly 1.5 times the size of text file. Previously it was around 4-5 times. Its remarkable change. Waiting for more suggestions. Thanks, On Mon, Nov 4, 2013 at 5:05 PM, Alan Gauld wrote: > On 04/11/13 11:07, Amal Thomas wrote: > > I am currently using this method to load my text file: >> *f = open("output.txt") >> content=io.StringIO(f.read()) >> f.close()* >> >> But I have found that this method uses 4 times the size of text file. >> > > So why not use > > > f = open("output.txt") > content=f.read() > f.close() > > And process the file as a raw string? > > Is there a reason for using the StringIO? > > -- > Alan G > Author of the Learn to Program web site > http://www.alan-g.me.uk/ > http://www.flickr.com/photos/alangauldphotos > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor > -- *AMAL THOMASFourth Year Undergraduate Student Department of Biotechnology IIT KHARAGPUR-721302* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Load Entire File into memory
On 04/11/13 11:07, Amal Thomas wrote: I am currently using this method to load my text file: *f = open("output.txt") content=io.StringIO(f.read()) f.close()* But I have found that this method uses 4 times the size of text file. So why not use f = open("output.txt") content=f.read() f.close() And process the file as a raw string? Is there a reason for using the StringIO? -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] Load Entire File into memory
Hi, I am new to python. I am working in computational biology and I have to deal with text files of huge size. I know how to read line by line from a text file. I want to know the best method in *python3* to load the enire file into ram and do the operations.(since this saves time) I am currently using this method to load my text file: *f = open("output.txt")content=io.StringIO(f.read())f.close()* But I have found that this method uses 4 times the size of text file.( if output.txt is 1 gb total ram usage of the code is approx 3.5 gb :( ). Kindly suggest me a better way to do this. Working on Python 3.3.1,ubuntu 13.04(Linux 3.8.0-29-generic x64) Thanks -- *AMAL THOMAS * ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor