Re: [Tutor] concurrent file reading using python

2012-03-26 Thread Abhishek Pratap
Thanks Walter and  Steven for the insight. I guess I will post my
question to python main mailing list and see if people have anything
to say.

-Abhi

On Mon, Mar 26, 2012 at 3:28 PM, Walter Prins  wrote:
> Abhi,
>
> On 26 March 2012 19:05, Abhishek Pratap  wrote:
>> I want to utilize the power of cores on my server and read big files
>> (> 50Gb) simultaneously by seeking to N locations. Process each
>> separate chunk and merge the output. Very similar to MapReduce
>> concept.
>>
>> What I want to know is the best way to read a file concurrently. I
>> have read about file-handle.seek(),  os.lseek() but not sure if thats
>> the way to go. Any used cases would be of help.
>
> Your idea won't work.  Reading from disk is not a CPU-bound process,
> it's an I/O bound process.  Meaning, the speed by which you can read
> from a conventional mechanical hard disk drive is not constrained by
> how fast your CPU is, but generally by how fast your disk(s) can read
> data from the disk surface, which is limited by the rotation speed and
> areal density of the data on the disk (and the seek time), and by how
> fast it can shovel the data down it's I/O bus.  And *that* speed is
> still orders of magnitude slower than your RAM and your CPU.  So, in
> reality even just one of your cores will spend the vast majority of
> its time waiting for the disk when reading your 50GB file.  There's
> therefore __no__ way to make your file reading faster by increasing
> your __CPU cores__ -- the only way is by improving your disk I/O
> throughput.  You can for example stripe several hard disks together in
> RAID0 (but that increases the risk of data loss due to data being
> spread over multiple drives) and/or ensure you use a faster I/O
> subsystem (move to SATA3 if you're currently using SATA2 for example),
> and/or use faster hard disks (use 10,000 or 15,000 RPM instead of
> 7,200, or switch to SSD [solid state] disks.)  Most of these options
> will cost you a fair bit of money though, so consider these thoughts
> in that light.
>
> Walter
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] concurrent file reading using python

2012-03-26 Thread Steven D'Aprano

Abhishek Pratap wrote:

Hi Guys


I want to utilize the power of cores on my server and read big files
(> 50Gb) simultaneously by seeking to N locations.


Yes, you have many cores on the server. But how many hard drives is each file 
on? If all the files are on one disk, then you will *kill* performance dead by 
forcing the drive to seek backwards and forwards:


seek to 12345678
read a block
seek to 9947500
read a block
seek to 5891124
read a block
seek back to 12345678 + 1 block
read another block
seek back to 9947500 + 1 block
read another block
...

The drive will spend most of its time seeking instead of reading.

Even if you have multiple hard drives in a RAID array, performance will depend 
strongly the details of how it is configured (RAID1, RAID0, software RAID, 
hardware RAID, etc.) and how smart the controller is.


Chances are, though, that the controller won't be smart enough. Particularly 
if you have hardware RAID, which in my experience tends to be more expensive 
and less useful than software RAID (at least for Linux).


And what are you planning on doing with the files once you have read them? I 
don't know how much memory your server has got, but I'd be very surprised if 
you can fit the entire > 50 GB file in RAM at once. So you're going to read 
the files and merge the output... by writing them to the disk. Now you have 
the drive trying to read *and* write simultaneously.


TL; DR:

Tasks which are limited by disk IO are not made faster by using a faster CPU, 
since the bottleneck is disk access, not CPU speed.


Back in the Ancient Days when tape was the only storage medium, there were a 
lot of programs optimised for slow IO. Unfortunately this is pretty much a 
lost art -- although disk access is thousands or tens of thousands of times 
slower than memory access, it is so much faster than tape that people don't 
seem to care much about optimising disk access.




What I want to know is the best way to read a file concurrently. I
have read about file-handle.seek(),  os.lseek() but not sure if thats
the way to go. Any used cases would be of help.


Optimising concurrent disk access is a specialist field. You may be better off 
asking for help on the main Python list, comp.lang.python or 
python-l...@python.org, and hope somebody has some experience with this. But 
chances are very high that you will need to search the web for forums 
dedicated to concurrent disk access, and translate from whatever language(s) 
they are using to Python.



--
Steven

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] concurrent file reading using python

2012-03-26 Thread Prasad, Ramit
> I want to utilize the power of cores on my server and read big files
> (> 50Gb) simultaneously by seeking to N locations. Process each
> separate chunk and merge the output. Very similar to MapReduce
> concept.
> 
> What I want to know is the best way to read a file concurrently. I
> have read about file-handle.seek(),  os.lseek() but not sure if thats
> the way to go. Any used cases would be of help.
> 
> PS: did find some links on stackoverflow but it was not clear to me if
> I found the right solution.
>

Have you done any testing in this space? I would assume 
you would be memory/IO bound and not CPU bound. Using 
multiple cores would not help non-CPU bound tasks.

I would try and write an initial program that does what
you want without attempting to optimize and then do some
profiling to see if you are using waiting on the CPU
or if you are (as I suspect) waiting on hard disk / memory.

Actually, if you only need small chunks of the file at 
a time and you iterate over the file (for line in file-handle:)
instead of using file-handle.readlines() you will 
probably only be IO bound due to the way Python file 
handling works.

But either way, test first then optimize. :)

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] concurrent file reading using python

2012-03-26 Thread Abhishek Pratap
Hi Guys


I want to utilize the power of cores on my server and read big files
(> 50Gb) simultaneously by seeking to N locations. Process each
separate chunk and merge the output. Very similar to MapReduce
concept.

What I want to know is the best way to read a file concurrently. I
have read about file-handle.seek(),  os.lseek() but not sure if thats
the way to go. Any used cases would be of help.

PS: did find some links on stackoverflow but it was not clear to me if
I found the right solution.


Thanks!
-Abhi
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor