Thanks for the update.
I actually knew projects that went 1GB+ partly for NN space, but also as disk &
CPU performance went up, you got more throughput by reducing the amount of time
spent on task setup per MB of data. Even though your per file nominal peak
bandwidth went was halved for every d
In the past, "block size" and "size of block N" were completely
separate concepts in HDFS.
The former was often referred to as "default block size" or "preferred
block" size or some such thing. Basically it was the point at which
we'd call it a day and move on to the next block, whenever any bloc
HADOOP-11601 tightens up the filesystem spec by saying "if len(file) > 0,
getFileStatus().getBlockSize() > 0"
this is to stop filesystems (most recently s3a) returning 0 as a block size,
which then kills any analytics work that tries to partition the workload by
blocksize.
I'm currently chang