Re: Theory question: good values for FileStatus.getBlockSize()

2015-02-20 Thread Steve Loughran
Thanks for the update. I actually knew projects that went 1GB+ partly for NN space, but also as disk & CPU performance went up, you got more throughput by reducing the amount of time spent on task setup per MB of data. Even though your per file nominal peak bandwidth went was halved for every d

Re: Theory question: good values for FileStatus.getBlockSize()

2015-02-17 Thread Colin P. McCabe
In the past, "block size" and "size of block N" were completely separate concepts in HDFS. The former was often referred to as "default block size" or "preferred block" size or some such thing. Basically it was the point at which we'd call it a day and move on to the next block, whenever any bloc

Theory question: good values for FileStatus.getBlockSize()

2015-02-16 Thread Steve Loughran
HADOOP-11601 tightens up the filesystem spec by saying "if len(file) > 0, getFileStatus().getBlockSize() > 0" this is to stop filesystems (most recently s3a) returning 0 as a block size, which then kills any analytics work that tries to partition the workload by blocksize. I'm currently chang