Re: ORC file across multiple HDFS blocks

Owen O'Malley Tue, 28 Apr 2015 11:11:18 -0700

You can also use the C++ reader to read a set of stripes. Look at the
ReaderOptions.range(offset, length), which selects the range of stripes to
process in terms of bytes.


.. Owen

On Tue, Apr 28, 2015 at 11:02 AM, Demai Ni <nid...@gmail.com> wrote:

> Alan and Grant,
>
> many thanks. Grant's comment is exact on the point that I am exploring.
>
> A bit background here. I am working on a MPP way to read ORC files through
> this C++ API (https://github.com/hortonworks/orc) by Owen and team. The
> MPP mechanism is using one(or several) independent process per each HDFS
> node, and work like a Client code to read ORC file(s). Currently, the
> assignment of each process is scheduled at ORC file level, which would
> encounter the issue of "lost of data locality" described by Grant.  I
> didn't realize that we can make the scheduling at stripe-level.  Good to
> know that, which surely make sense.
>
> Demai
>
> On Tue, Apr 28, 2015 at 8:34 AM, Grant Overby (groverby) <
> grove...@cisco.com> wrote:
>
>>  Expanding on Alan’s post:
>>
>>  Files are intended to span many blocks and a single file may be read by
>> many mappers. In order for a file to be read by many mappers, it goes
>> through a process called input splits which splits the input around hdfs
>> block boundaries.
>>
>>  If a unit of data within a file crosses a hdfs block, a portion of that
>> unit of data must be sent from the node which contains block/mapper of one
>> portion to the node that contains the block/mapper of the other portion.
>> Take a csv file for example, in this case a unit of data is a line, and
>> transferring a portion of a line between boxes is no big deal.
>>
>>  This changes a bit for orc files as the unit of data is a stripe. An
>> orc stripe is typically a few hundred MB. Without some additional logic, a
>> substantial part of data locality would be lost; however, orc has such
>> additional logic. The stripe size of the orc file should be set a few MB
>> below the hdfs block size and padding enable to produce a 1:1 relationship
>> between an orc stripe and an hdfs block. How many stripes or blocks that
>> are “in" a single file is of no consequence so long as this 1:1
>> relationship is maintained.
>>
>>  Below is an example config for 128mb hdfs blocks.
>>
>>              Configuration writerConf = new Configuration();
>> // other config
>>             OrcFile.WriterOptions writerOptions =
>> OrcFile.writerOptions(writerConf);
>>              writerOptions.blockPadding(true);
>>             writerOptions.stripeSize(122 * 1024 * 1024);
>>  // other options
>>              Writer writer = OrcFile.createWriter(path, writerOptions);
>>
>>
>>
>>         *Grant Overby*
>> Software Engineer
>> Cisco.com <http://www.cisco.com/>
>> grove...@cisco.com
>> Mobile: *865 724 4910 <865%20724%204910>*
>>
>>
>>
>>        Think before you print.
>>
>> This email may contain confidential and privileged material for the sole
>> use of the intended recipient. Any review, use, distribution or disclosure
>> by others is strictly prohibited. If you are not the intended recipient (or
>> authorized to receive for the recipient), please contact the sender by
>> reply email and delete all copies of this message.
>>
>> Please click here
>> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
>> Company Registration Information.
>>
>>
>>
>>
>>   From: Alan Gates <alanfga...@gmail.com>
>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> Date: Monday, April 27, 2015 at 2:05 PM
>> To: "user@hive.apache.org" <user@hive.apache.org>
>> Subject: Re: ORC file across multiple HDFS blocks
>>
>>  to cross blocks and hence n
>>
>
>

Re: ORC file across multiple HDFS blocks

Reply via email to