How is hadoop going to handle the next generation disks?

2011-04-07 Thread Edward Capriolo
I have a 0.20.2 cluster. I notice that our nodes with 2 TB disks waste tons of disk io doing a 'du -sk' of each data directory. Instead of 'du -sk' why not just do this with java.io.file? How is this going to work with 4TB 8TB disks and up ? It seems like calculating used and free disk space could

Re: How is hadoop going to handle the next generation disks?

2011-04-08 Thread sridhar basam
How many files do you have per node? What i find is that most of my inodes/dentries are almost always cached so calculating the 'du -sk' on a host even with hundreds of thousands of files the du -sk generally uses high i/o for a couple of seconds. I am using 2TB disks too. Sridhar On Fri, Apr

Re: How is hadoop going to handle the next generation disks?

2011-04-08 Thread sridhar basam
BTW this is on systems which have a lot of RAM and aren't under high load. If you find that your system is evicting dentries/inodes from its cache, you might want to experiment with drop vm.vfs_cache_pressure from its default so that the they are preferred over the pagecache. At the extreme, setti

Re: How is hadoop going to handle the next generation disks?

2011-04-08 Thread Edward Capriolo
On Fri, Apr 8, 2011 at 12:24 PM, sridhar basam wrote: > > BTW this is on systems which have a lot of RAM and aren't under high load. > If you find that your system is evicting dentries/inodes from its cache, you > might want to experiment with drop vm.vfs_cache_pressure from its default so > that

Re: How is hadoop going to handle the next generation disks?

2011-04-08 Thread sridhar basam
On Fri, Apr 8, 2011 at 1:59 PM, Edward Capriolo wrote: > > Right. Most inodes are always cached when: > > 1) small disks > 2) light load. > But that is not the case with hadoop. > > Making the problem worse: > It seems like hadoop seems to issues 'du -sk' for all disks at the > same time. This pu