Re: [Tutor] concurrent file reading using python

Steven D'Aprano Mon, 26 Mar 2012 15:23:33 -0700

Abhishek Pratap wrote:

Hi Guys



I want to utilize the power of cores on my server and read big files
(> 50Gb) simultaneously by seeking to N locations.

Yes, you have many cores on the server. But how many hard drives is each fileon? If all the files are on one disk, then you will *kill* performance dead byforcing the drive to seek backwards and forwards:


seek to 12345678
read a block
seek to 9947500
read a block
seek to 5891124
read a block
seek back to 12345678 + 1 block
read another block
seek back to 9947500 + 1 block
read another block
...

The drive will spend most of its time seeking instead of reading.

Even if you have multiple hard drives in a RAID array, performance will dependstrongly the details of how it is configured (RAID1, RAID0, software RAID,hardware RAID, etc.) and how smart the controller is.

Chances are, though, that the controller won't be smart enough. Particularlyif you have hardware RAID, which in my experience tends to be more expensiveand less useful than software RAID (at least for Linux).

And what are you planning on doing with the files once you have read them? Idon't know how much memory your server has got, but I'd be very surprised ifyou can fit the entire > 50 GB file in RAM at once. So you're going to readthe files and merge the output... by writing them to the disk. Now you havethe drive trying to read *and* write simultaneously.


TL; DR:

Tasks which are limited by disk IO are not made faster by using a faster CPU,since the bottleneck is disk access, not CPU speed.

Back in the Ancient Days when tape was the only storage medium, there were alot of programs optimised for slow IO. Unfortunately this is pretty much alost art -- although disk access is thousands or tens of thousands of timesslower than memory access, it is so much faster than tape that people don'tseem to care much about optimising disk access.

What I want to know is the best way to read a file concurrently. I
have read about file-handle.seek(),  os.lseek() but not sure if thats
the way to go. Any used cases would be of help.

Optimising concurrent disk access is a specialist field. You may be better offasking for help on the main Python list, comp.lang.python or[email protected], and hope somebody has some experience with this. Butchances are very high that you will need to search the web for forumsdedicated to concurrent disk access, and translate from whatever language(s)they are using to Python.



--
Steven

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] concurrent file reading using python

Reply via email to