Re: ORC file across multiple HDFS blocks

Demai Ni Tue, 28 Apr 2015 11:30:18 -0700

Owen,

cool.  That is great. Thanks


Demai

On Tue, Apr 28, 2015 at 11:10 AM, Owen O'Malley <omal...@apache.org> wrote:

> You can also use the C++ reader to read a set of stripes. Look at the
> ReaderOptions.range(offset, length), which selects the range of stripes to
> process in terms of bytes.
>
> .. Owen
>
> On Tue, Apr 28, 2015 at 11:02 AM, Demai Ni <nid...@gmail.com> wrote:
>
>> Alan and Grant,
>>
>> many thanks. Grant's comment is exact on the point that I am exploring.
>>
>> A bit background here. I am working on a MPP way to read ORC files
>> through this C++ API (https://github.com/hortonworks/orc) by Owen and
>> team. The MPP mechanism is using one(or several) independent process per
>> each HDFS node, and work like a Client code to read ORC file(s). Currently,
>> the assignment of each process is scheduled at ORC file level, which would
>> encounter the issue of "lost of data locality" described by Grant.  I
>> didn't realize that we can make the scheduling at stripe-level.  Good to
>> know that, which surely make sense.
>>
>> Demai
>>
>> On Tue, Apr 28, 2015 at 8:34 AM, Grant Overby (groverby) <
>> grove...@cisco.com> wrote:
>>
>>>  Expanding on Alan’s post:
>>>
>>>  Files are intended to span many blocks and a single file may be read
>>> by many mappers. In order for a file to be read by many mappers, it goes
>>> through a process called input splits which splits the input around hdfs
>>> block boundaries.
>>>
>>>  If a unit of data within a file crosses a hdfs block, a portion of
>>> that unit of data must be sent from the node which contains block/mapper of
>>> one portion to the node that contains the block/mapper of the other
>>> portion. Take a csv file for example, in this case a unit of data is a
>>> line, and transferring a portion of a line between boxes is no big deal.
>>>
>>>  This changes a bit for orc files as the unit of data is a stripe. An
>>> orc stripe is typically a few hundred MB. Without some additional logic, a
>>> substantial part of data locality would be lost; however, orc has such
>>> additional logic. The stripe size of the orc file should be set a few MB
>>> below the hdfs block size and padding enable to produce a 1:1 relationship
>>> between an orc stripe and an hdfs block. How many stripes or blocks that
>>> are “in" a single file is of no consequence so long as this 1:1
>>> relationship is maintained.
>>>
>>>  Below is an example config for 128mb hdfs blocks.
>>>
>>>              Configuration writerConf = new Configuration();
>>> // other config
>>>             OrcFile.WriterOptions writerOptions =
>>> OrcFile.writerOptions(writerConf);
>>>              writerOptions.blockPadding(true);
>>>             writerOptions.stripeSize(122 * 1024 * 1024);
>>>  // other options
>>>              Writer writer = OrcFile.createWriter(path, writerOptions);
>>>
>>>
>>>
>>>         *Grant Overby*
>>> Software Engineer
>>> Cisco.com <http://www.cisco.com/>
>>> grove...@cisco.com
>>> Mobile: *865 724 4910 <865%20724%204910>*
>>>
>>>
>>>
>>>        Think before you print.
>>>
>>> This email may contain confidential and privileged material for the sole
>>> use of the intended recipient. Any review, use, distribution or disclosure
>>> by others is strictly prohibited. If you are not the intended recipient (or
>>> authorized to receive for the recipient), please contact the sender by
>>> reply email and delete all copies of this message.
>>>
>>> Please click here
>>> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
>>> Company Registration Information.
>>>
>>>
>>>
>>>
>>>   From: Alan Gates <alanfga...@gmail.com>
>>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>>> Date: Monday, April 27, 2015 at 2:05 PM
>>> To: "user@hive.apache.org" <user@hive.apache.org>
>>> Subject: Re: ORC file across multiple HDFS blocks
>>>
>>>  to cross blocks and hence n
>>>
>>
>>
>

Re: ORC file across multiple HDFS blocks

Reply via email to