What happens when MR produces data splits, and those splits don't align on block boundaries? I've read that MR will attempt to make data splits near block boundaries to improve data locality, but isn't there always some slop where records straddle the block boundaries, resulting in an extra HDFS connection just to get the half-record in the other block? Does this impact performance? Are there file formats that attempt to enforce data alignment?
- HDFS data and non-aligned splits John Lilley
- RE: HDFS data and non-aligned splits John Lilley
- Re: HDFS data and non-aligned splits Harsh J