Expanding on Alan’s post:

Files are intended to span many blocks and a single file may be read by many 
mappers. In order for a file to be read by many mappers, it goes through a 
process called input splits which splits the input around hdfs block boundaries.

If a unit of data within a file crosses a hdfs block, a portion of that unit of 
data must be sent from the node which contains block/mapper of one portion to 
the node that contains the block/mapper of the other portion. Take a csv file 
for example, in this case a unit of data is a line, and transferring a portion 
of a line between boxes is no big deal.

This changes a bit for orc files as the unit of data is a stripe. An orc stripe 
is typically a few hundred MB. Without some additional logic, a substantial 
part of data locality would be lost; however, orc has such additional logic. 
The stripe size of the orc file should be set a few MB below the hdfs block 
size and padding enable to produce a 1:1 relationship between an orc stripe and 
an hdfs block. How many stripes or blocks that are “in" a single file is of no 
consequence so long as this 1:1 relationship is maintained.

Below is an example config for 128mb hdfs blocks.

            Configuration writerConf = new Configuration();
// other config
            OrcFile.WriterOptions writerOptions = 
OrcFile.writerOptions(writerConf);
            writerOptions.blockPadding(true);
            writerOptions.stripeSize(122 * 1024 * 1024);
// other options
            Writer writer = OrcFile.createWriter(path, writerOptions);



[http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726]

Grant Overby
Software Engineer
Cisco.com<http://www.cisco.com/>
grove...@cisco.com<mailto:grove...@cisco.com>
Mobile: 865 724 4910






[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think before you 
print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.

Please click 
here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for 
Company Registration Information.





From: Alan Gates <alanfga...@gmail.com<mailto:alanfga...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Monday, April 27, 2015 at 2:05 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: ORC file across multiple HDFS blocks

to cross blocks and hence n

Reply via email to