Owen, cool. That is great. Thanks
Demai On Tue, Apr 28, 2015 at 11:10 AM, Owen O'Malley <omal...@apache.org> wrote: > You can also use the C++ reader to read a set of stripes. Look at the > ReaderOptions.range(offset, length), which selects the range of stripes to > process in terms of bytes. > > .. Owen > > On Tue, Apr 28, 2015 at 11:02 AM, Demai Ni <nid...@gmail.com> wrote: > >> Alan and Grant, >> >> many thanks. Grant's comment is exact on the point that I am exploring. >> >> A bit background here. I am working on a MPP way to read ORC files >> through this C++ API (https://github.com/hortonworks/orc) by Owen and >> team. The MPP mechanism is using one(or several) independent process per >> each HDFS node, and work like a Client code to read ORC file(s). Currently, >> the assignment of each process is scheduled at ORC file level, which would >> encounter the issue of "lost of data locality" described by Grant. I >> didn't realize that we can make the scheduling at stripe-level. Good to >> know that, which surely make sense. >> >> Demai >> >> On Tue, Apr 28, 2015 at 8:34 AM, Grant Overby (groverby) < >> grove...@cisco.com> wrote: >> >>> Expanding on Alan’s post: >>> >>> Files are intended to span many blocks and a single file may be read >>> by many mappers. In order for a file to be read by many mappers, it goes >>> through a process called input splits which splits the input around hdfs >>> block boundaries. >>> >>> If a unit of data within a file crosses a hdfs block, a portion of >>> that unit of data must be sent from the node which contains block/mapper of >>> one portion to the node that contains the block/mapper of the other >>> portion. Take a csv file for example, in this case a unit of data is a >>> line, and transferring a portion of a line between boxes is no big deal. >>> >>> This changes a bit for orc files as the unit of data is a stripe. An >>> orc stripe is typically a few hundred MB. Without some additional logic, a >>> substantial part of data locality would be lost; however, orc has such >>> additional logic. The stripe size of the orc file should be set a few MB >>> below the hdfs block size and padding enable to produce a 1:1 relationship >>> between an orc stripe and an hdfs block. How many stripes or blocks that >>> are “in" a single file is of no consequence so long as this 1:1 >>> relationship is maintained. >>> >>> Below is an example config for 128mb hdfs blocks. >>> >>> Configuration writerConf = new Configuration(); >>> // other config >>> OrcFile.WriterOptions writerOptions = >>> OrcFile.writerOptions(writerConf); >>> writerOptions.blockPadding(true); >>> writerOptions.stripeSize(122 * 1024 * 1024); >>> // other options >>> Writer writer = OrcFile.createWriter(path, writerOptions); >>> >>> >>> >>> *Grant Overby* >>> Software Engineer >>> Cisco.com <http://www.cisco.com/> >>> grove...@cisco.com >>> Mobile: *865 724 4910 <865%20724%204910>* >>> >>> >>> >>> Think before you print. >>> >>> This email may contain confidential and privileged material for the sole >>> use of the intended recipient. Any review, use, distribution or disclosure >>> by others is strictly prohibited. If you are not the intended recipient (or >>> authorized to receive for the recipient), please contact the sender by >>> reply email and delete all copies of this message. >>> >>> Please click here >>> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for >>> Company Registration Information. >>> >>> >>> >>> >>> From: Alan Gates <alanfga...@gmail.com> >>> Reply-To: "user@hive.apache.org" <user@hive.apache.org> >>> Date: Monday, April 27, 2015 at 2:05 PM >>> To: "user@hive.apache.org" <user@hive.apache.org> >>> Subject: Re: ORC file across multiple HDFS blocks >>> >>> to cross blocks and hence n >>> >> >> >