Re: [Tutor] Load Entire File into memory

2013-11-05 Thread eryksun
On Mon, Nov 4, 2013 at 11:26 AM, Amal Thomas  wrote:
> @Dave: thanks.. By the way I am running my codes on a server with about
> 100GB ram but I cant afford my code to use 4-5 times the size of the text
> file. Now I am using  read() / readlines() , these seems to be more
> efficient in memory usage than io.StringIO(f.read()).

f.read() creates a string to initialize a StringIO object. You could
instead initialize a BytesIO object with a mapped file; that should
cut the peak RSS down by half. If you need decoded text, add a
TextIOWrapper.

import io
import mmap

with open('output.txt') as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mf:
content = io.TextIOWrapper(io.BytesIO(mf))

for line in content:
'process line'

However, before you do something extreme (like say... loading a 50 GiB
file into RAM), try tweaking the TextIOWrapper object's readline() by
increasing _CHUNK_SIZE. This can be up to 2**63-1 in a 64-bit process.

with open('output.txt') as content:
content._CHUNK_SIZE = 65536
for line in content:
'process line'

Check content.buffer.tell() to confirm that the file pointer is
increasing in steps of the given chunk size.

Built-in open() also lets you set the "buffering" size for the
BufferedReader, content.buffer. However, in this case I don't think
you need to worry about it. content.readline() calls
content.buffer.read1() to read directly from the FileIO object,
content.buffer.raw.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-05 Thread William Ray Wing
On Nov 5, 2013, at 11:12 AM, Alan Gauld  wrote:

> On 05/11/13 02:02, Danny Yoo wrote:
> 
>> To visualize the sheer scale of the problem, see:
>> 
>> http://i.imgur.com/X1Hi1.gif
>> 
>> which would normally be funny, except that it's not quite a joke.  :P
> 
> I think I'm missing something. All I see in Firefox is
> a vertical red bar. And in Chrome I don't even get that,
> just a blank screen...
> 
> ???
> 
> -- 
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.flickr.com/photos/alangauldphotos

It took me a while…  If you put your cursor up in the extreme upper left hand 
corner of that red bar, you get a + sign that allows you to expand the image.  
In the expansion you will see text that explains the graphical scales in 
question.  A pixel (L1 cache), a short bar of pixels (L2 cache), a longer bar 
(RAM) and finally that huge block of pixels that represent disk latency.

-Bill
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-05 Thread Steven D'Aprano
On Tue, Nov 05, 2013 at 04:12:51PM +, Alan Gauld wrote:
> On 05/11/13 02:02, Danny Yoo wrote:
> 
> >To visualize the sheer scale of the problem, see:
> >
> >http://i.imgur.com/X1Hi1.gif
> >
> >which would normally be funny, except that it's not quite a joke.  :P
> 
> I think I'm missing something. All I see in Firefox is
> a vertical red bar. And in Chrome I don't even get that,
> just a blank screen...

Can't speak for Chrome, sounds like a bug. But Firefox defaults to 
scaling pictures to fit within the window. If you mouse over the image, 
you'll get a magnifying glass pointer. Click on the image, and it will 
redisplay at full size. Scroll to the very top, and you will find a 
little bit of text, a single red pixel representing the latency of cache 
memory, a dozen or so red pixels representing the latency of RAM, and a 
monsterously huge block of thousands and thousands and thousands of red 
pixels representing the latency of hard drives.

Feel free to scroll and count the pixels :-)
 

-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-05 Thread Alan Gauld

On 05/11/13 02:02, Danny Yoo wrote:


To visualize the sheer scale of the problem, see:

http://i.imgur.com/X1Hi1.gif

which would normally be funny, except that it's not quite a joke.  :P


I think I'm missing something. All I see in Firefox is
a vertical red bar. And in Chrome I don't even get that,
just a blank screen...

???

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-05 Thread Oscar Benjamin
On 5 November 2013 13:20, Amal Thomas  wrote:
> On Mon, Nov 4, 2013 at 10:00 PM, Steven D'Aprano 
> wrote:
>>
>
>>
>> import os
>> filename = "YOUR FILE NAME HERE"
>> print("File size:", os.stat(filename).st_size)
>> f = open(filename)
>> content = f.read()
>> print("Length of content actually read:", len(content))
>> print("Current file position:", f.tell())
>> f.close()
>>
>>
>> and send us the output.
>
>
>  This is the output:
>File size: 50297501884
>Length of content actually read: 50297501884
>Current file position: 50297501884
> This Code used 61.4 GB RAM and 59.6 GB swap (I had ensured that no other
> important process were running in my server before running this :D)

If you are using the swap then this will almost certainly be the
biggest slowdown in your program (and any other program on the same
machine). If there is some way to reorganise your code so that you
don't need to load everything into memory then (if you do it right)
you should be able to make it *much* faster.


Oscar
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-05 Thread Oscar Benjamin
On 4 November 2013 17:41, Amal Thomas  wrote:
> @Steven: Thank you...My input data is basically AUGC and newlines... I would
> like to know about bytearray technique. Please suggest me some links or
> reference.. I will go through the profiler and check whether the code
> maintains linearity with the input files.

Amal can you just give *some* explanation of what you're doing?

I can think of many possible ways to optimise it or to use less memory
but it depends on what you're actually doing and you really haven't
given enough information.

If you show a short data sample and a short piece of code that is
similar to what you are doing then people here would be in a better
position to help. Please read this: http://sscce.org/

You have repeatedly made claims such as "X is faster than Y" but
you've also said that you are a beginner to Python. Be aware that
testing speeds on a small file is completely different from testing on
a large file or that testing speeds on a file that is already cached
by the OS is completely different from testing on one that is not.
There are many things that can make the difference.


Oscar
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-05 Thread Amal Thomas
On Mon, Nov 4, 2013 at 10:00 PM, Steven D'Aprano 
wrote:
>

>
> import os
> filename = "YOUR FILE NAME HERE"
> print("File size:", os.stat(filename).st_size)
> f = open(filename)
> content = f.read()
> print("Length of content actually read:", len(content))
> print("Current file position:", f.tell())
> f.close()
>
>
> and send us the output.


 This is the output:
   File size: 50297501884
   Length of content actually read: 50297501884
   Current file position: 50297501884
This Code used 61.4 GB RAM and 59.6 GB swap (I had ensured that no other
important process were running in my server before running this :D)
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Stefan Behnel
Amal Thomas, 04.11.2013 14:55:
> I have checked the execution time manually as well as I found it through my
> code. During execution of my code, at start, I stored my initial time(start
> time) to a variable  and at the end calculated time taken to run the code =
> end time - start time. There was a significance difference in time.

You should make sure that there are no caching effects here. Your operating
system may have loaded the file into memory (assuming that you have enough
of that) after the first read and then served it from there when you ran
the second benchmark.

So, make sure you measure the time twice for both, preferably running both
benchmarks in reverse order the second time.

That being said, it's not impossible that f.readlines() is faster than
line-wise iteration, because it knows right from the start that it will
read the entire file, so it can optimise for it (didn't check if it
actually does, this might have changed in Py3.3, for example).

Stefan


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Steven D'Aprano
On Mon, Nov 04, 2013 at 06:02:47PM -0800, Danny Yoo wrote:

> To visualize the sheer scale of the problem, see:
> 
> http://i.imgur.com/X1Hi1.gif
> 
> which would normally be funny, except that it's not quite a joke.  :P

Nice visualisation! Was that yours?

> So you want to minimize hard disk usage as much as possible.  "Thrashing"
> is precisely the situation you do not want to have when running a large
> analysis.

Yes, thrashing is a disaster for performance. But I think that hard 
drive latency fails to demonstrate just how big a disaster. True, hard 
drive access time is about 100 times slower than RAM. But that's just a 
constant scale factor. What really, really kills performance with 
thrashing is that the memory manager is trying to move chunks of memory 
around, and the overall "move smaller blocks of memory around to make 
space for a really big block" algorithm ends up with quadratic (or 
worse!) performance.


-- 
Steven

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Steven D'Aprano
I mustly agree with Alan, but a couple of little quibbles:

On Tue, Nov 05, 2013 at 01:10:39AM +, ALAN GAULD wrote:

> >@Alan: Thanks.. I have checked the both ways( reading line by line by not 
> >loading into ram , 
> > other loading entire file to ram and then reading line by line)  for files 
> > with 2-3GB. 
> 
> OK, But 2-3G will nearly always live entirely in RAM on a modern computer.

Speak for yourself. Some of us are still using "modern computers" with 
1-2 GB of RAM :-(


> > Only change which i have done is in the reading part , rest of the code was 
> > kept same. 
> > There was significant time difference. Please note that I started this 
> > thread stating that 
> > when I am using io.StringIO(f.read()) in code it uses a memory of almost 
> > 4-5 times the 
> > input file size. Now using read() or readlines() it has reduced to 1.5 
> > times... 
> 
> Yes a raw string is always going to be more efficient in memory use than 
> StringIO.

It depends what you're doing with it. The beauty of StringIO is that it 
emulates an in-memory file, so you can modify it in place. String 
objects are immutable and cannot be modified in place, so if you have to 
make changes to it, you have to make a copy with the change. For large 
strings, say, over 100MB, the overhead can get painful.


> > Also as I have mentioned I cant afford to run my code using 4-5 times 
> > memory. 
> > Total resource available in my server is about 180 GB memory (approx 64 GB 
> > RAM + 128GB swap). 
> 
> OK, There is a huge difference between having 100G of RAM and having 64G+128G 
> swap.
> swap is basically disk so if you are reading your data into memory and that 
> memory is 
> bouncing in and out of swap things will slow down by an order of magnitude. 

At least. Hard drive technology is more like two or even three orders of 
magnitude slower than RAM access (100 or 1000 times slower), and 
including the overhead of the memory manager moving things about, there 
is no upper limit to how large the penalty can be. If you get away with 
only 10 times slower, you're lucky. In my experience, 100-1000 times 
slower is more common (although my experience is on machines with fairly 
small amounts of RAM in the first place) and sometimes slow enough that 
even the operating system stops responding.

Plan to avoid using swap space :-)

> You need to try to optimise to use real RAM and minimise use of swap. 

Agreed.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Danny Yoo
>
> You _must_ avoid swap at all costs here.  You may not understand the
> point, so a little more explanation: touching swap is several orders of
> magnitude more expensive than anything else you are doing in your program.
>
> CPU operations are on the order of nanoseconds. (10^-9)
>
> Disk operations are on the order of milliseconds.  (10^-3)
>
> References:
>
> http://en.wikipedia.org/wiki/Instructions_per_second
>
> http://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics
>
>

To visualize the sheer scale of the problem, see:

http://i.imgur.com/X1Hi1.gif

which would normally be funny, except that it's not quite a joke.  :P


So you want to minimize hard disk usage as much as possible.  "Thrashing"
is precisely the situation you do not want to have when running a large
analysis.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Danny Yoo
>
>
> > Also as I have mentioned I cant afford to run my code using 4-5 times
> memory.
> > Total resource available in my server is about 180 GB memory (approx 64
> GB RAM + 128GB swap).
>
> OK, There is a huge difference between having 100G of RAM and having
> 64G+128G swap.
> swap is basically disk so if you are reading your data into memory and
> that memory is
> bouncing in and out of swap things will slow down by an order of
> magnitude.
> You need to try to optimise to use real RAM and minimise use of swap.
>


I concur with Alan, and want to state his point more forcefully.  If you
are hitting swap, you are computationally DOOMED and must do something
different.


You _must_ avoid swap at all costs here.  You may not understand the point,
so a little more explanation: touching swap is several orders of magnitude
more expensive than anything else you are doing in your program.

CPU operations are on the order of nanoseconds. (10^-9)

Disk operations are on the order of milliseconds.  (10^-3)

References:

http://en.wikipedia.org/wiki/Instructions_per_second
http://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics

As soon as you start touching your swap space to simulate virtual memory,
you've lost the battle.


We were trying not to leap to conclusions till we knew more.  Now we know
more.  If your system has much less RAM than can fit your dataset at once,
trying to read it all at once on your single machine, into an in-memory
buffer, is wrong.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread ALAN GAULD
Forwarding to tutor list. Please use Reply All in responses.


From: Amal Thomas 
>To: Alan Gauld  
>Sent: Monday, 4 November 2013, 17:26
>Subject: Re: [Tutor] Load Entire File into memory
> 
>
>
>@Alan: Thanks.. I have checked the both ways( reading line by line by not 
>loading into ram , 
> other loading entire file to ram and then reading line by line)  for files 
> with 2-3GB. 

OK, But 2-3G will nearly always live entirely in RAM on a modern computer.

> Only change which i have done is in the reading part , rest of the code was 
> kept same. 
> There was significant time difference. Please note that I started this thread 
> stating that 
> when I am using io.StringIO(f.read()) in code it uses a memory of almost 4-5 
> times the 
> input file size. Now using read() or readlines() it has reduced to 1.5 
> times... 

Yes a raw string is always going to be more efficient in memory use than 
StringIO.

> Also as I have mentioned I cant afford to run my code using 4-5 times memory. 
> Total resource available in my server is about 180 GB memory (approx 64 GB 
> RAM + 128GB swap). 

OK, There is a huge difference between having 100G of RAM and having 64G+128G 
swap.
swap is basically disk so if you are reading your data into memory and that 
memory is 
bouncing in and out of swap things will slow down by an order of magnitude. 
You need to try to optimise to use real RAM and minimise use of swap. 
> So before starting to process my 30-50GB input files I am keen to know the 
> best way.


Performance tuning is always a tricky topic and needs to be done on a case by 
case 
basis. There are simply too many factors to try to say that method (A) will 
always be 
faster than method (B) It depends on the nature of the data, its source, your 
target 
data structures, your algorithms, your output format and target etc as well as 
the 
physical machines being used... We need a lot more detail about the task before 
we can give any solid advice. And even then you should verify it before 
assuming 
you are done.

Alan G.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Danny Yoo
On Mon, Nov 4, 2013 at 9:41 AM, Amal Thomas  wrote:

> @Steven: Thank you...My input data is basically AUGC and newlines... I
> would like to know about bytearray technique. Please suggest me some links
> or reference.. I will go through the profiler and check whether the code
> maintains linearity with the input files.
>
>
Hi Amal,

I suspect that what's been missing here throughout this thread is more
concrete information about the problem's background.  I would strongly
suggest we make sure that we understand the problem before making more
assumptions.


1.  What is the nature of the operation that you are doing on your data?
 Can you briefly discuss its details?  Does it involve random-access, or is
it a sequential operation?  Are the operations independent regardless of
what line you are on, or is there some kind of dependency across lines?
 Does it involve pattern matching, or...?  Are you maintaining some
in-memory data structure as you're walking through the file?

The reason why we need to know this is because it can affect file access
patterns.  It may provide a hint as to whether or not you can avoid loading
the whole file into memory or not.  It may even effect whether or not you
can distribute your work among several computers.

Here's also why it's important to talk more about what the problem is
trying to solve.  Your question has been assuming that the dominating
factor in your program's runtime is the access of your data, and that
loading the entire file into memory will improve performance.   But I see
no evidence to support that assumption yet.  Why should I not believe that
the time that's being spent isn't being spent paging in virtual memory, for
example, due to something else in your program's operations?  In which
case, then trying to load the file entirely into memory will be
counterproductive.


2.  What is the format of your input data?  You mention it is AUGC and
newlines, but more details would be really helpful.

Why is it line-oriented, for example?  I mean that as a serious question.
 Is it significant?  Is it a FASTA file?  Is it some kind of homebrewed
format?

Please be as specific as you can be here: you may be duplicating effort
that folks who have spent _years_ on sequence-reading libraries have
already done for you.  Specifically, you might be able to reuse Biopython's
libraries for sequence IO.

http://biopython.org/wiki/SeqIO

By trying to cook up file parsing by yourself, you may be making a mistake.
 For example, there might be issues in Python 3 due to Unicode encodings:


http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

which might contribute to an unexpected increase in the size of a string's
memory representation.  Hard to say, since it depends on a host of factors.
 But knowing that, other folks have probably encountered and solved this
problem already.  Concretely, I'm pretty sure Biopython's SeqIO does the
Right Thing in terms of reading files in binary mode and reading the line
contents as bytes, as opposed to regular strings, and representing the
sequence in some memory-efficient way.

At the very least, I know that they think about these kind of problems a
lot:

http://web.archiveorange.com/archive/v/5dAwXDMfufikePQqtPgx

Probably a lot more than us.  :P

So if it's possible, try to leverage what's already out there.  You should
almost certainly not be writing your own sequence-reading code.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Dave Angel
On 4/11/2013 11:26, Amal Thomas wrote:

> @Dave: thanks.. By the way I am running my codes on a server with about
> 100GB ram but I cant afford my code to use 4-5 times the size of the text
> file. Now I am using  read() / readlines() , these seems to be more
> efficient in memory usage than io.StringIO(f.read()).
>

Sorry I misspoke about read() on a large file.  I was confusing it with
something else.

However, note that in any environment if you have a large buffer, and
you force the system to copy that large buffer, you'll be using
(temporarily at least) twice the space.  And usually the original can't
be freed, for various technical reasons.

The real question is how you're going to be addressing the data, and
wha.t constraints are on that data.

Since you think you need it all in memory, you clearly are planning to
access it randomly. Since the data is apparently ASCII characters, and
you're running at least 3.3, you won't be paying the penalty if it turns
out to be strings.  But there may be alternate ways of encoding each
line which save space and/or make it faster to use.  One big buffer
imaging the file is likely to be one of the worst.

Are the lines variable length?  Do you ever deal randomly with a portion
of a line, or only the whole thing?  If the line is multiple ASCII
characters, are their order significant?  how many different symbols can
appear in a single line?  how many different ones total?  (probably
excluding the newline).  What's the average line length?

Each of these questions may lead to exploring different optimzation
strategies.  But I've done enough speculating.


-- 
DaveA


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Dave Angel
On Tue, 5 Nov 2013 02:53:41 +1100, Steven D'Aprano 
 wrote:

Dave, do you have a reference for that? As far as I can tell, read()
will read to EOF unless you open the file in non-blocking mode.


No. I must be just remembering something from another language. 
Sorry.


--
DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
@Steven: Thank you...My input data is basically AUGC and newlines... I
would like to know about bytearray technique. Please suggest me some links
or reference.. I will go through the profiler and check whether the code
maintains linearity with the input files.




> > It's probably worth putting some more timing statements into your code
> > to see where the time is going because it's not the reading from the
> > disk that's the problem.
>
> The first thing I would do is run the code on three smaller sample
> files:
>
> 50MB
> 100MB
> 200MB
>
> The time taken should approximately double as you double the size of the
> file: say it takes 2 hours to process the 50MB file, 4 hours for the
> 100MB file and 8 hours for the 200 MB file, that's linear performance
> and isn't too bad.
>
> But if performance isn't linear, say 2 hours, 4 hours, 16 hours, then
> you're in trouble and you *desperately* need to reconsider the algorithm
> being used. Either that, or just accept that this is an inherently slow
> calculation and it will take a week or two.
>
> Amal, another thing you should try is use the Python profiler on your
> code (again, on a smaller sample file). The profiler will show you where
> the time is being spent.
>
> Unfortunately the profiler may slow your code down, so it is important
> to use it on manageable sized data. The profiler is explained here:
>
> http://docs.python.org/3/library/profile.html
>
> If you need any help, don't hesitate to ask.
>
>
> > >trying to optimize my code to get the outputs in less time and memory
> > >efficiently.
> >
> > Memory efficiency is easy, do it line by line off the disk.
>
> This assumes that you can process one line at a time, sequentially. I
> expect that is not the case.
>
>
> --
> Steven
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>



-- 


*AMAL THOMAS*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Steven D'Aprano
On Mon, Nov 04, 2013 at 04:54:16PM +, Alan Gauld wrote:
> On 04/11/13 16:34, Amal Thomas wrote:
> >@Joel: The code runs for weeks..input file which I have to process in
> >very huge(in 50 gbs). So its not a matter of hours.its matter of days
> >and weeks..
> 
> OK, but that's not down to reading the file from disk.
> Reading a 50G file will only take a few minutes if you have enough RAM, 
> which seems to be the case.

Not really. There is still some uncertainty (at least in my mind!). For 
instance, I assume that Amal doesn't have sole access to the server. So 
there could be another dozen users all trying to read 50GB files at 
once, in a machine with only 100GB of memory... 

Once the server starts paging, performance will plummett.


> If it's taking days/weeks you must be doing 
> some incredibly time consuming processing.

Well, yes, it's biology :-)



> It's probably worth putting some more timing statements into your code 
> to see where the time is going because it's not the reading from the 
> disk that's the problem.

The first thing I would do is run the code on three smaller sample 
files:

50MB
100MB
200MB

The time taken should approximately double as you double the size of the 
file: say it takes 2 hours to process the 50MB file, 4 hours for the 
100MB file and 8 hours for the 200 MB file, that's linear performance 
and isn't too bad.

But if performance isn't linear, say 2 hours, 4 hours, 16 hours, then 
you're in trouble and you *desperately* need to reconsider the algorithm 
being used. Either that, or just accept that this is an inherently slow 
calculation and it will take a week or two.

Amal, another thing you should try is use the Python profiler on your 
code (again, on a smaller sample file). The profiler will show you where 
the time is being spent.

Unfortunately the profiler may slow your code down, so it is important 
to use it on manageable sized data. The profiler is explained here:

http://docs.python.org/3/library/profile.html

If you need any help, don't hesitate to ask.


> >trying to optimize my code to get the outputs in less time and memory
> >efficiently.
> 
> Memory efficiency is easy, do it line by line off the disk.

This assumes that you can process one line at a time, sequentially. I 
expect that is not the case.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Alan Gauld

On 04/11/13 16:34, Amal Thomas wrote:

@Joel: The code runs for weeks..input file which I have to process in
very huge(in 50 gbs). So its not a matter of hours.its matter of days
and weeks..


OK, but that's not down to reading the file from disk.
Reading a 50G file will only take a few minutes if you have enough RAM, 
which seems to be the case. If it's taking days/weeks you must be doing 
some incredibly time consuming processing.


It's probably worth putting some more timing statements into your code 
to see where the time is going because it's not the reading from the 
disk that's the problem.



trying to optimize my code to get the outputs in less time and memory
efficiently.


Memory efficiency is easy, do it line by line off the disk.
time efficiency is most likely down to your processing algorithm
if we are talking about days.


--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Steven D'Aprano
On Mon, Nov 04, 2013 at 11:27:52AM -0500, Joel Goldstick wrote:

> If you are new to python why are you so concerned about the speed of
> your code.

Amal is new to Python but he's not new to biology, he's a 4th year 
student. With a 50GB file, I expect he is analysing something to do with 
DNA sequencing, which depending on exactly what he is trying to do could 
involve O(N) or even O(N**2) algorithms. An O(N) algorithm on a 50GB 
file, assuming 100,000 steps per second, will take over 5 days to 
complete. An O(N**2) algorithm, well, it's nearly unthinkable: nearly 
800 million years. You *really* don't want O(N**2) algorithms with big 
data.

I would expect that with a big DNA sequencing problem, running time 
would be measured in days rather than minutes or hours. So yes, this is 
probably a case where optimizing for speed is not premature.

We really don't know enough about his problem to advise him on how to 
speed it up. If the data file is guaranteed to be nothing but GCTA 
bases, and newlines, it may be better to read the data file into memory 
as a bytearray rather than a string. Especially if he needs to modify it 
in place. But this is getting into some fairly advanced territory, I 
wouldn't like to predict what will be faster without testing on real 
data.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
@Steven: Thanks... Right now I cant access the files. I will send you the
output when I can.

--
Please try this little bit of code, replacing the file name with the
actual name of your 50GB data file:

import os
filename = "YOUR FILE NAME HERE"
print("File size:", os.stat(filename).st_size)
f = open(filename)
content = f.read()
print("Length of content actually read:", len(content))
print("Current file position:", f.tell())
f.close()


and send us the output.

-- 

*AMAL THOMAS*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
@Joel: The code runs for weeks..input file which I have to process in very
huge(in 50 gbs). So its not a matter of hours.its matter of days and
weeks..I was using C++. Recently I switched over to Python. I am trying to
optimize my code to get the outputs in less time and memory efficiently.


On Mon, Nov 4, 2013 at 9:57 PM, Joel Goldstick wrote:

>
>
>
> If you are new to python why are you so concerned about the speed of
> your code.  You never say how long it takes.  Do these files take
> hours to process? or minutes or seconds?I suggest you write your
> code in a way that is clear and understandable, then try to optimize
> it if necessary.
>
>

-- 


*AMAL THOMAS*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Steven D'Aprano
On Mon, Nov 04, 2013 at 07:00:29PM +0530, Amal Thomas wrote:
> Yes I have found that after loading to RAM and then reading lines by lines
> saves a huge amount of time since my text files are very huge.

This is remarkable, and quite frankly incredible. I wonder whether you 
are misinterpreting what you are seeing? Under normal circumstances, 
with all but quite high-end machines, trying to read a 50GB file into 
memory all at once will be effectively impossible. Suppose your computer 
has 24GB of RAM. The OS and other running applications can be expected 
to use some of that, but even ignoring this, it is impossible to read a 
50GB file into memory all at once with only 24GB.

What I would expect is that unless you have *at least* double the amount 
of memory as the size of the file (in this case, at least 100GB), either 
Python will give you a MemoryError, or the operating system will try 
paging memory into swap-space, which is *painfully* slow. I've been 
in the situation where I accidently tried reading a file bigger than the 
installed RAM, and it ran overnight (14+ hours), locked up and stopped 
responding, and I finally had to unplug the power and restart the 
machine.

So unless you have 100+ GB in your computer, which would put it in 
seriously high-end server class, I find it difficult to believe that 
you are actually reading the entire file into memory.

Please try this little bit of code, replacing the file name with the 
actual name of your 50GB data file:

import os
filename = "YOUR FILE NAME HERE"
print("File size:", os.stat(filename).st_size)
f = open(filename)
content = f.read()
print("Length of content actually read:", len(content))
print("Current file position:", f.tell())
f.close()


and send us the output.

Thanks,



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Joel Goldstick
 "I am new to python. I am working in computational biology and I have
to deal with text files of huge size. I know how to read line by line
from a text file. I want to know the best method in  python3 to load
the enire file into ram and do the operations.(since this saves time)"


If you are new to python why are you so concerned about the speed of
your code.  You never say how long it takes.  Do these files take
hours to process? or minutes or seconds?I suggest you write your
code in a way that is clear and understandable, then try to optimize
it if necessary.


>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor



-- 
Joel Goldstick
http://joelgoldstick.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
@Dave: thanks.. By the way I am running my codes on a server with about
100GB ram but I cant afford my code to use 4-5 times the size of the text
file. Now I am using  read() / readlines() , these seems to be more
efficient in memory usage than io.StringIO(f.read()).


On Mon, Nov 4, 2013 at 9:23 PM, Steven D'Aprano  wrote:

> On Mon, Nov 04, 2013 at 02:48:11PM +, Dave Angel wrote:
>
> > Now I understand.  Processing line by line is slower because it actually
> > reads the whole file.  The code you showed earlier:
> >
> > >I am currently using this method to load my text file:
> > > *f = open("output.txt")
> > > content=io.StringIO(f.read())
> > > f.close()*
> > >   But I have found that this method uses 4 times the size of text file.
> >
> > will only read a tiny portion of the file.  You don't have any loop on
> > the read() statement, you just read the first buffer full. So naturally
> > it'll be much faster.
>
>
> Dave, do you have a reference for that? As far as I can tell, read()
> will read to EOF unless you open the file in non-blocking mode.
>
> http://docs.python.org/3/library/io.html#io.BufferedIOBase.read
>
>
> > I am of course assuming you don't have a machine with 100+ gig of RAM.
>
> There is that, of course. High-end servers can have multiple hundreds of
> GB of RAM, but desktop and laptop machines rarely have anywhere near
> that.
>
>
> --
> Steven
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>



-- 



*AMAL THOMASFourth Year Undergraduate Student Department of Biotechnology
IIT KHARAGPUR-721302*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Steven D'Aprano
On Mon, Nov 04, 2013 at 02:48:11PM +, Dave Angel wrote:

> Now I understand.  Processing line by line is slower because it actually
> reads the whole file.  The code you showed earlier:
> 
> >I am currently using this method to load my text file:
> > *f = open("output.txt")
> > content=io.StringIO(f.read())
> > f.close()*
> >   But I have found that this method uses 4 times the size of text file.
> 
> will only read a tiny portion of the file.  You don't have any loop on
> the read() statement, you just read the first buffer full. So naturally
> it'll be much faster.


Dave, do you have a reference for that? As far as I can tell, read()
will read to EOF unless you open the file in non-blocking mode.

http://docs.python.org/3/library/io.html#io.BufferedIOBase.read


> I am of course assuming you don't have a machine with 100+ gig of RAM.

There is that, of course. High-end servers can have multiple hundreds of
GB of RAM, but desktop and laptop machines rarely have anywhere near 
that.


-- 
Steven

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Dave Angel
On 4/11/2013 09:04, Amal Thomas wrote:

> @William:
> Thanks,
>
> My Line size varies from 40 to 550 characters. Please note that text file
> which I have to process is in gigabytes ( approx 50 GB ) . This was the
> code which i used to process line by line without loading into memory.

Now I understand.  Processing line by line is slower because it actually
reads the whole file.  The code you showed earlier:

>I am currently using this method to load my text file:
> *f = open("output.txt")
> content=io.StringIO(f.read())
> f.close()*
>   But I have found that this method uses 4 times the size of text file.

will only read a tiny portion of the file.  You don't have any loop on
the read() statement, you just read the first buffer full. So naturally
it'll be much faster.

I am of course assuming you don't have a machine with 100+ gig of RAM.


By the way, would you please stop posting html messages to a text
newsgroup?  They come out completely blank on my other newsreader
(except for the Tutor maillist trailer, which IS text), and I end up
having to fire up this one just to read your messages. But even if you
don't care about me, you should realize that messages sent as html as
very frequently garbled when they're interpreted as text.

-- 
DaveA


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
@William:
Thanks,

My Line size varies from 40 to 550 characters. Please note that text file
which I have to process is in gigabytes ( approx 50 GB ) . This was the
code which i used to process line by line without loading into memory.

*for lines in open('uniqname.txt'): *

* *

On Mon, Nov 4, 2013 at 7:16 PM, William Ray Wing  wrote:

> On Nov 4, 2013, at 8:30 AM, Amal Thomas  wrote:
> How long are the lines in your file?  In particular, are they many
> hundreds or thousands of characters long, or are they only few hundred
> characters, say 200 or less?
>
> Unless they are so long as to exceed the normal buffer size of your OS's
> read-ahead buffer, I strongly suspect that the big time sink in your
> attempt to read line-by-line was some inadvertent inefficiency that you
> introduced.  Normally, when reading from a text file, python buffers the
> reads (or uses the host OS buffering).  Those reads pull in huge chunks of
> text WAY ahead of where the actual python processing is going on, and are
> VERY efficient.
>
> -Bill




-- 

*AMAL THOMAS*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
Hi,
@Peter:
I have checked the execution time manually as well as I found it through my
code. During execution of my code, at start, I stored my initial time(start
time) to a variable  and at the end calculated time taken to run the code =
end time - start time. There was a significance difference in time.

Thanks,

On Mon, Nov 4, 2013 at 7:11 PM, Peter Otten <__pete...@web.de> wrote:

> Amal Thomas wrote:
>
> > Yes I have found that after loading to RAM and then reading lines by
> lines
> > saves a huge amount of time since my text files are very huge.
>
> How exactly did you find out? You should only see a speed-up if you iterate
> over the data at least twice.
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>



-- 



*AMAL THOMASFourth Year Undergraduate Student Department of Biotechnology
IIT KHARAGPUR-721302*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread William Ray Wing
On Nov 4, 2013, at 8:30 AM, Amal Thomas  wrote:

> Yes I have found that after loading to RAM and then reading lines by lines 
> saves a huge amount of time since my text files are very huge.
> 

[huge snip]

> -- 
> AMAL THOMAS
> Fourth Year Undergraduate Student
> Department of Biotechnology
> IIT KHARAGPUR-721302
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

How long are the lines in your file?  In particular, are they many hundreds or 
thousands of characters long, or are they only few hundred characters, say 200 
or less?

Unless they are so long as to exceed the normal buffer size of your OS's 
read-ahead buffer, I strongly suspect that the big time sink in your attempt to 
read line-by-line was some inadvertent inefficiency that you introduced.  
Normally, when reading from a text file, python buffers the reads (or uses the 
host OS buffering).  Those reads pull in huge chunks of text WAY ahead of where 
the actual python processing is going on, and are VERY efficient.

-Bill
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Peter Otten
Amal Thomas wrote:

> Yes I have found that after loading to RAM and then reading lines by lines
> saves a huge amount of time since my text files are very huge.

How exactly did you find out? You should only see a speed-up if you iterate 
over the data at least twice.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
Yes I have found that after loading to RAM and then reading lines by lines
saves a huge amount of time since my text files are very huge.


On Mon, Nov 4, 2013 at 6:46 PM, Alan Gauld wrote:

> On 04/11/13 13:06, Amal Thomas wrote:
>
>  Present code:
>>
>>
>> *f = open("output.txt")
>> content=f.read().split('\n')
>> f.close()
>>
>
> If your objective is to save time, then you should replace this with
> f.readlines() which will save you reprocesasing the entire file to remove
> the newlines.
>
>  for lines in content:
>> *  *
>> *content.clear()*
>>
>
> But if you are processing line by line what makes you think that reading
> the entire file into RAM and then reprocessing it is faster than reading it
> line by line?
>
> Have you tried that on aqnother file and measutred any significant
> improvement? There are times when reading into RAM is faster but I'm not
> sure this will be one of them.
>
> for line in f:
>process line
>
> may be your best bet.
>
>  *f = open("output.txt")
>> content=io.StringIO(f.read())
>> f.close()
>> for lines in content:
>>
>> *
>> *content.close()*
>>
>
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.flickr.com/photos/alangauldphotos
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>



-- 



*AMAL THOMASFourth Year Undergraduate Student Department of Biotechnology
IIT KHARAGPUR-721302*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Alan Gauld

On 04/11/13 13:06, Amal Thomas wrote:


Present code:

*f = open("output.txt")
content=f.read().split('\n')
f.close()


If your objective is to save time, then you should replace this with 
f.readlines() which will save you reprocesasing the entire file to 
remove the newlines.



for lines in content:
*  *
*content.clear()*


But if you are processing line by line what makes you think that reading 
the entire file into RAM and then reprocessing it is faster than reading 
it line by line?


Have you tried that on aqnother file and measutred any significant 
improvement? There are times when reading into RAM is faster but I'm not 
sure this will be one of them.


for line in f:
   process line

may be your best bet.


*f = open("output.txt")
content=io.StringIO(f.read())
f.close()
for lines in content:
   
*
*content.close()*


--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
Hi,

Thanks Alan.

Now I have made changes in code :

Present code:





*f = open("output.txt")content=f.read().split('\n') f.close()for lines in
content:*
*  *
*content.clear()*

Previous code:






*f = open("output.txt") content=io.StringIO(f.read()) f.close()for lines in
content:  *
*content.close()*


   Now I have found that memory use is roughly 1.5 times the size of text
file. Previously it was around 4-5 times. Its remarkable change. Waiting
for more suggestions.

Thanks,



On Mon, Nov 4, 2013 at 5:05 PM, Alan Gauld wrote:

> On 04/11/13 11:07, Amal Thomas wrote:
>
> I am currently using this method to load my text file:
>> *f = open("output.txt")
>> content=io.StringIO(f.read())
>> f.close()*
>>
>>   But I have found that this method uses 4 times the size of text file.
>>
>
> So why not use
>
>
> f = open("output.txt")
> content=f.read()
> f.close()
>
> And process the file as a raw string?
>
> Is there a reason for using the StringIO?
>
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.flickr.com/photos/alangauldphotos
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>



-- 



*AMAL THOMASFourth Year Undergraduate Student Department of Biotechnology
IIT KHARAGPUR-721302*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Load Entire File into memory

2013-11-04 Thread Alan Gauld

On 04/11/13 11:07, Amal Thomas wrote:


   I am currently using this method to load my text file:
*f = open("output.txt")
content=io.StringIO(f.read())
f.close()*
  But I have found that this method uses 4 times the size of text file.


So why not use

f = open("output.txt")
content=f.read()
f.close()

And process the file as a raw string?

Is there a reason for using the StringIO?

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] Load Entire File into memory

2013-11-04 Thread Amal Thomas
Hi,

 I am new to python. I am working in computational biology and I have to
deal with text files of huge size. I know how to read line by line from a
text file. I want to know the best method in  *python3* to load the enire
file into ram and do the operations.(since this saves time)
  I am currently using this method to load my text file:


*f = open("output.txt")content=io.StringIO(f.read())f.close()*
 But I have found that this method uses 4 times the size of text file.( if
output.txt is 1 gb total ram usage of the code is approx 3.5 gb :( ).
Kindly suggest me a better way to do this.

Working on
Python 3.3.1,ubuntu 13.04(Linux  3.8.0-29-generic x64)

Thanks

-- 


*AMAL THOMAS *
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor