Re: ORC file across multiple HDFS blocks

2015-04-28 Thread Demai Ni
Owen,

cool.  That is great. Thanks

Demai

On Tue, Apr 28, 2015 at 11:10 AM, Owen O'Malley  wrote:

> You can also use the C++ reader to read a set of stripes. Look at the
> ReaderOptions.range(offset, length), which selects the range of stripes to
> process in terms of bytes.
>
> .. Owen
>
> On Tue, Apr 28, 2015 at 11:02 AM, Demai Ni  wrote:
>
>> Alan and Grant,
>>
>> many thanks. Grant's comment is exact on the point that I am exploring.
>>
>> A bit background here. I am working on a MPP way to read ORC files
>> through this C++ API (https://github.com/hortonworks/orc) by Owen and
>> team. The MPP mechanism is using one(or several) independent process per
>> each HDFS node, and work like a Client code to read ORC file(s). Currently,
>> the assignment of each process is scheduled at ORC file level, which would
>> encounter the issue of "lost of data locality" described by Grant.  I
>> didn't realize that we can make the scheduling at stripe-level.  Good to
>> know that, which surely make sense.
>>
>> Demai
>>
>> On Tue, Apr 28, 2015 at 8:34 AM, Grant Overby (groverby) <
>> grove...@cisco.com> wrote:
>>
>>>  Expanding on Alan’s post:
>>>
>>>  Files are intended to span many blocks and a single file may be read
>>> by many mappers. In order for a file to be read by many mappers, it goes
>>> through a process called input splits which splits the input around hdfs
>>> block boundaries.
>>>
>>>  If a unit of data within a file crosses a hdfs block, a portion of
>>> that unit of data must be sent from the node which contains block/mapper of
>>> one portion to the node that contains the block/mapper of the other
>>> portion. Take a csv file for example, in this case a unit of data is a
>>> line, and transferring a portion of a line between boxes is no big deal.
>>>
>>>  This changes a bit for orc files as the unit of data is a stripe. An
>>> orc stripe is typically a few hundred MB. Without some additional logic, a
>>> substantial part of data locality would be lost; however, orc has such
>>> additional logic. The stripe size of the orc file should be set a few MB
>>> below the hdfs block size and padding enable to produce a 1:1 relationship
>>> between an orc stripe and an hdfs block. How many stripes or blocks that
>>> are “in" a single file is of no consequence so long as this 1:1
>>> relationship is maintained.
>>>
>>>  Below is an example config for 128mb hdfs blocks.
>>>
>>>  Configuration writerConf = new Configuration();
>>> // other config
>>> OrcFile.WriterOptions writerOptions =
>>> OrcFile.writerOptions(writerConf);
>>>  writerOptions.blockPadding(true);
>>> writerOptions.stripeSize(122 * 1024 * 1024);
>>>  // other options
>>>  Writer writer = OrcFile.createWriter(path, writerOptions);
>>>
>>>
>>>
>>> *Grant Overby*
>>> Software Engineer
>>> Cisco.com <http://www.cisco.com/>
>>> grove...@cisco.com
>>> Mobile: *865 724 4910 <865%20724%204910>*
>>>
>>>
>>>
>>>Think before you print.
>>>
>>> This email may contain confidential and privileged material for the sole
>>> use of the intended recipient. Any review, use, distribution or disclosure
>>> by others is strictly prohibited. If you are not the intended recipient (or
>>> authorized to receive for the recipient), please contact the sender by
>>> reply email and delete all copies of this message.
>>>
>>> Please click here
>>> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
>>> Company Registration Information.
>>>
>>>
>>>
>>>
>>>   From: Alan Gates 
>>> Reply-To: "user@hive.apache.org" 
>>> Date: Monday, April 27, 2015 at 2:05 PM
>>> To: "user@hive.apache.org" 
>>> Subject: Re: ORC file across multiple HDFS blocks
>>>
>>>  to cross blocks and hence n
>>>
>>
>>
>


Re: ORC file across multiple HDFS blocks

2015-04-28 Thread Demai Ni
Alan and Grant,

many thanks. Grant's comment is exact on the point that I am exploring.

A bit background here. I am working on a MPP way to read ORC files through
this C++ API (https://github.com/hortonworks/orc) by Owen and team. The MPP
mechanism is using one(or several) independent process per each HDFS node,
and work like a Client code to read ORC file(s). Currently, the assignment
of each process is scheduled at ORC file level, which would encounter the
issue of "lost of data locality" described by Grant.  I didn't realize that
we can make the scheduling at stripe-level.  Good to know that, which
surely make sense.

Demai

On Tue, Apr 28, 2015 at 8:34 AM, Grant Overby (groverby)  wrote:

>  Expanding on Alan’s post:
>
>  Files are intended to span many blocks and a single file may be read by
> many mappers. In order for a file to be read by many mappers, it goes
> through a process called input splits which splits the input around hdfs
> block boundaries.
>
>  If a unit of data within a file crosses a hdfs block, a portion of that
> unit of data must be sent from the node which contains block/mapper of one
> portion to the node that contains the block/mapper of the other portion.
> Take a csv file for example, in this case a unit of data is a line, and
> transferring a portion of a line between boxes is no big deal.
>
>  This changes a bit for orc files as the unit of data is a stripe. An orc
> stripe is typically a few hundred MB. Without some additional logic, a
> substantial part of data locality would be lost; however, orc has such
> additional logic. The stripe size of the orc file should be set a few MB
> below the hdfs block size and padding enable to produce a 1:1 relationship
> between an orc stripe and an hdfs block. How many stripes or blocks that
> are “in" a single file is of no consequence so long as this 1:1
> relationship is maintained.
>
>  Below is an example config for 128mb hdfs blocks.
>
>  Configuration writerConf = new Configuration();
> // other config
> OrcFile.WriterOptions writerOptions =
> OrcFile.writerOptions(writerConf);
>  writerOptions.blockPadding(true);
> writerOptions.stripeSize(122 * 1024 * 1024);
>  // other options
>  Writer writer = OrcFile.createWriter(path, writerOptions);
>
>
>
> *Grant Overby*
> Software Engineer
> Cisco.com 
> grove...@cisco.com
> Mobile: *865 724 4910 <865%20724%204910>*
>
>
>
>Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
>  for
> Company Registration Information.
>
>
>
>
>   From: Alan Gates 
> Reply-To: "user@hive.apache.org" 
> Date: Monday, April 27, 2015 at 2:05 PM
> To: "user@hive.apache.org" 
> Subject: Re: ORC file across multiple HDFS blocks
>
>  to cross blocks and hence n
>


ORC file across multiple HDFS blocks

2015-04-24 Thread Demai Ni
hi, Guys,

I am working on directly READ ORC files from HDFS cluster, and hopefully to
leverage HDFS local shortcuit READ (
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html)
as much as possible

According to ORC design, each ORC file usually contain several Stripes, and
each Stripe has default of 250MB for the efficient reads from HDFS.  With
that, size of a ORC file can be easily at GB level, consisted of several
HDFS blocks.  There is a good chance that
1) a ORC file across several HDFS data nodes.
2) a Stripe may across two HDFS blocks, and lands on two different physical
nodes

With this in mind, should I design my ORC file to
1) only contain one Stripe?
2) make ensure(either by larger HDFS block or smaller Stripe size) that
each ORC file contain only one HDFS block?

Does it look reasonable? thanks

Demai


load TPCH HBase tables through Hive

2015-03-02 Thread Demai Ni
hi, folks,

I am using the HBaseintergration feature from hive (
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) to load
TPCH tables into HBase. Hive 0.13 and HBase 0.98.6.

The load works well. However, as documented here:
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-KeyUniqueness.


The key uniqueness prevents me from loading all 'lineitem' rows. As
'lineitem' table is using "L_ORDERKEY, L_LINENUMBER" as compound primary
key. If I only mapped to 'L_ORDERKEY" as hbase key(aka, row #). Many rows
will get overwritten.

Any suggestion? someone on this list must go through this already. :-).
Thanks

BTW, here is my hive ddl.

create table hbase_lineitem( *l_orderkey bigint*, l_partkey bigint,
l_suppkey int, l_linenumber  bigint, l_quantity  double, l_extendedprice
double, l_discount  double, l_tax  double, l_returnflag  string,
l_linestatus  string, l_shipdate  string, l_commitdate  string,
l_receiptdate  string, l_shipinstruct  string, l_shipmode  string,
l_comment  string ) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.columns.mapping"* = ":key*,l_partkey:val,l_suppkey:val,
l_linenumber:val, l_quantity:val, l_extendedprice:val, l_discount:val,
l_tax:val, l_returnflag:val, l_linestatus:val, l_shipdate:val,
l_commitdate:val, l_receiptdate:val, l_shipinstruct:val, l_shipmode:val,
l_comment:val ") TBLPROPERTIES ("hbase.table.name" = "lineitem");


insert overwrite table hbase_lineitem select * from lineitem;

Demai