You can also use the C++ reader to read a set of stripes. Look at the ReaderOptions.range(offset, length), which selects the range of stripes to process in terms of bytes.
.. Owen On Tue, Apr 28, 2015 at 11:02 AM, Demai Ni <nid...@gmail.com> wrote: > Alan and Grant, > > many thanks. Grant's comment is exact on the point that I am exploring. > > A bit background here. I am working on a MPP way to read ORC files through > this C++ API (https://github.com/hortonworks/orc) by Owen and team. The > MPP mechanism is using one(or several) independent process per each HDFS > node, and work like a Client code to read ORC file(s). Currently, the > assignment of each process is scheduled at ORC file level, which would > encounter the issue of "lost of data locality" described by Grant. I > didn't realize that we can make the scheduling at stripe-level. Good to > know that, which surely make sense. > > Demai > > On Tue, Apr 28, 2015 at 8:34 AM, Grant Overby (groverby) < > grove...@cisco.com> wrote: > >> Expanding on Alan’s post: >> >> Files are intended to span many blocks and a single file may be read by >> many mappers. In order for a file to be read by many mappers, it goes >> through a process called input splits which splits the input around hdfs >> block boundaries. >> >> If a unit of data within a file crosses a hdfs block, a portion of that >> unit of data must be sent from the node which contains block/mapper of one >> portion to the node that contains the block/mapper of the other portion. >> Take a csv file for example, in this case a unit of data is a line, and >> transferring a portion of a line between boxes is no big deal. >> >> This changes a bit for orc files as the unit of data is a stripe. An >> orc stripe is typically a few hundred MB. Without some additional logic, a >> substantial part of data locality would be lost; however, orc has such >> additional logic. The stripe size of the orc file should be set a few MB >> below the hdfs block size and padding enable to produce a 1:1 relationship >> between an orc stripe and an hdfs block. How many stripes or blocks that >> are “in" a single file is of no consequence so long as this 1:1 >> relationship is maintained. >> >> Below is an example config for 128mb hdfs blocks. >> >> Configuration writerConf = new Configuration(); >> // other config >> OrcFile.WriterOptions writerOptions = >> OrcFile.writerOptions(writerConf); >> writerOptions.blockPadding(true); >> writerOptions.stripeSize(122 * 1024 * 1024); >> // other options >> Writer writer = OrcFile.createWriter(path, writerOptions); >> >> >> >> *Grant Overby* >> Software Engineer >> Cisco.com <http://www.cisco.com/> >> grove...@cisco.com >> Mobile: *865 724 4910 <865%20724%204910>* >> >> >> >> Think before you print. >> >> This email may contain confidential and privileged material for the sole >> use of the intended recipient. Any review, use, distribution or disclosure >> by others is strictly prohibited. If you are not the intended recipient (or >> authorized to receive for the recipient), please contact the sender by >> reply email and delete all copies of this message. >> >> Please click here >> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for >> Company Registration Information. >> >> >> >> >> From: Alan Gates <alanfga...@gmail.com> >> Reply-To: "user@hive.apache.org" <user@hive.apache.org> >> Date: Monday, April 27, 2015 at 2:05 PM >> To: "user@hive.apache.org" <user@hive.apache.org> >> Subject: Re: ORC file across multiple HDFS blocks >> >> to cross blocks and hence n >> > >