Thanks for the update.

I actually knew projects that went 1GB+ partly for NN space, but also as disk & 
CPU performance went up, you got more throughput by reducing the amount of time 
spent on task setup per MB of data. Even though your per file nominal peak 
bandwidth went was halved for every doubling of block size, it seemed to work 
well for 2.5" disks, and presumably for SSD.

Everyone was still scared of crossing the 2^31 byte barrier out of fear of 
being the first one to find the integer overflow



On 17 February 2015 at 22:40:41, Colin P. McCabe 
(cmcc...@apache.org<mailto:cmcc...@apache.org>) wrote:

In the past, "block size" and "size of block N" were completely
separate concepts in HDFS.

The former was often referred to as "default block size" or "preferred
block" size or some such thing. Basically it was the point at which
we'd call it a day and move on to the next block, whenever any block
got to that point. "default block size" was pretty much always 128MB
or 256MB in Real Clusters (although sometimes Apache Parquet would set
it as high as 1GB). We got tired of people configuring ridiculously
small block sizes by accident so HDFS-4305 added
dfs.namenode.fs-limits.min-block-size.

In the old world, the only block which could be smaller than the
"default block size" was the final block of a file. MR used default
block size as a guide to doing partitioning and we sort of ignored the
fact that the last block could be less than that.

Now that HDFS-3689 has been added to branch-2, it is no longer true
that all the blocks are the same size except the last one. The
ramifications of this are still to be determined. dfs.blocksize will
still be an upper bound on block size, but it will no longer be a
lower bound.


that's going to complicate the semantics of append() then, isn't it? Not in a 
bad way, simply mean the docs need updating



To answer your specific question: in HDFS, FileStatus#getBlockSize
will return the "preferred block size," not the size of any specific
block. So it's totally possible that none of the blocks in the file
actually have the size returned in FileStatus#getBlockSize.

The relevant code is here in FSDirectory.java:
> if (node.isFile()) {
> final INodeFile fileNode = node.asFile();
> size = fileNode.computeFileSize(snapshot);
> replication = fileNode.getFileReplication(snapshot);
> blocksize = fileNode.getPreferredBlockSize();
> isEncrypted = (feInfo != null) ||
> (isRawPath && isInAnEZ(INodesInPath.fromINode(node)));
> } else {
> isEncrypted = isInAnEZ(INodesInPath.fromINode(node));
> }
...
> return new HdfsFileStatus(
> ...
> blocksize,
> ...
> );

Probably s3 and the rest of the alternative FS gang should just return
the value of some configuration variable (possibly fs.local.block.size
or dfs.blocksize?). Even though "preferred block size" is a
completely bogus concept in s3, MapReduce and other frameworks still
use it to calculate splits. Since s3 never does local reads anyway,
there is no reason to prefer any block size over any other, except in
terms of dividing up the work.


It'll be a local value, just making sure that there is a good one. And for all 
filesystems we can mandate: >0 for (len>0). but not that it is a fixed value.

regards,
Colin

On Mon, Feb 16, 2015 at 9:44 AM, Steve Loughran <ste...@hortonworks.com> wrote:
>
> HADOOP-11601 tightens up the filesystem spec by saying "if len(file) > 0, 
> getFileStatus().getBlockSize() > 0"
>
> this is to stop filesystems (most recently s3a) returning 0 as a block size, 
> which then kills any analytics work that tries to partition the workload by 
> blocksize.
>
> I'm currently changing the markdown text to say
>
> MUST be >0 for a file size >0
> MAY be 0 for a file of size==0.
>
> + the relevant tests to check this.
>
> There's one thing I do want to understand from HDFS first: what about small 
> files.? That is: what does HDFS return as a blocksize if a file is smaller 
> than its block size?
>
> -Steve

Reply via email to