Thanks for the update. I actually knew projects that went 1GB+ partly for NN space, but also as disk & CPU performance went up, you got more throughput by reducing the amount of time spent on task setup per MB of data. Even though your per file nominal peak bandwidth went was halved for every doubling of block size, it seemed to work well for 2.5" disks, and presumably for SSD.
Everyone was still scared of crossing the 2^31 byte barrier out of fear of being the first one to find the integer overflow On 17 February 2015 at 22:40:41, Colin P. McCabe (cmcc...@apache.org<mailto:cmcc...@apache.org>) wrote: In the past, "block size" and "size of block N" were completely separate concepts in HDFS. The former was often referred to as "default block size" or "preferred block" size or some such thing. Basically it was the point at which we'd call it a day and move on to the next block, whenever any block got to that point. "default block size" was pretty much always 128MB or 256MB in Real Clusters (although sometimes Apache Parquet would set it as high as 1GB). We got tired of people configuring ridiculously small block sizes by accident so HDFS-4305 added dfs.namenode.fs-limits.min-block-size. In the old world, the only block which could be smaller than the "default block size" was the final block of a file. MR used default block size as a guide to doing partitioning and we sort of ignored the fact that the last block could be less than that. Now that HDFS-3689 has been added to branch-2, it is no longer true that all the blocks are the same size except the last one. The ramifications of this are still to be determined. dfs.blocksize will still be an upper bound on block size, but it will no longer be a lower bound. that's going to complicate the semantics of append() then, isn't it? Not in a bad way, simply mean the docs need updating To answer your specific question: in HDFS, FileStatus#getBlockSize will return the "preferred block size," not the size of any specific block. So it's totally possible that none of the blocks in the file actually have the size returned in FileStatus#getBlockSize. The relevant code is here in FSDirectory.java: > if (node.isFile()) { > final INodeFile fileNode = node.asFile(); > size = fileNode.computeFileSize(snapshot); > replication = fileNode.getFileReplication(snapshot); > blocksize = fileNode.getPreferredBlockSize(); > isEncrypted = (feInfo != null) || > (isRawPath && isInAnEZ(INodesInPath.fromINode(node))); > } else { > isEncrypted = isInAnEZ(INodesInPath.fromINode(node)); > } ... > return new HdfsFileStatus( > ... > blocksize, > ... > ); Probably s3 and the rest of the alternative FS gang should just return the value of some configuration variable (possibly fs.local.block.size or dfs.blocksize?). Even though "preferred block size" is a completely bogus concept in s3, MapReduce and other frameworks still use it to calculate splits. Since s3 never does local reads anyway, there is no reason to prefer any block size over any other, except in terms of dividing up the work. It'll be a local value, just making sure that there is a good one. And for all filesystems we can mandate: >0 for (len>0). but not that it is a fixed value. regards, Colin On Mon, Feb 16, 2015 at 9:44 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > HADOOP-11601 tightens up the filesystem spec by saying "if len(file) > 0, > getFileStatus().getBlockSize() > 0" > > this is to stop filesystems (most recently s3a) returning 0 as a block size, > which then kills any analytics work that tries to partition the workload by > blocksize. > > I'm currently changing the markdown text to say > > MUST be >0 for a file size >0 > MAY be 0 for a file of size==0. > > + the relevant tests to check this. > > There's one thing I do want to understand from HDFS first: what about small > files.? That is: what does HDFS return as a blocksize if a file is smaller > than its block size? > > -Steve