Re: Pig HBase integration

Serega Sheypak Sun, 28 Sep 2014 10:33:07 -0700

https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html
I'm not sure how does Pig HBaseStroage works. I suppose it would read all
data and then join it as usual dataset. So you should get serious hbase
perfomace degradation during read, you would get key-by-key read from the
whole table.
1. so join in pig
2. At first you load data from hbase table then operate on it. I don't see
a cse where you can use hbase table directly in join.



2014-09-28 17:02 GMT+04:00 Krishna Kalyan <[email protected]>:

>
> We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
> in each record) and weblog (2-3 TB, approx 50 columns in each record). We
> need to join the data sets using the locationId, which is in both the
> data-sets.
>
> We have 2 options:
> 1. Have both the data-sets in HDFS only and JOIN then on locationId, may
> be using Pig.
> 2. Since JOIN will be on locaitonId, which is primary key for location
> data set, if we store the location data set with locationId as rowkey in
> HBase and then use Pig query to do the join of weblog data set and location
> table (using HBaseStorage).
>
> The reason to think about this idea is reading data based on the key is
> faster in HBase, however we are not sure that in case of JOIN of 2 data
> sets, will Pig internally go for picking the individual location record for
> based on key or it reads through entire or few records from location table
> and then do the join. Based on this we can make the choice.
>
> We are free to use HDFS or HBase for any input or output data set, please
> advise which option can provide us better performance. Also if required,
> please point us to some good article on this.
>
>
> On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <[email protected]>
> wrote:
>
>> store location to hdfs
>> store weblog to hdfs
>> join them
>> use HBase bulk load tool to load join result to hbase.
>>
>> What's the reason to keep location dataset in hbase and weblogs in hdfs?
>>
>> You can expect data load perfomance improvement. For me it takes few
>> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
>> table.
>>
>> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <[email protected]>:
>>
>>> Thanks Serega,
>>>
>>> Our usecase details:
>>> We have a location table which will be stored in HBase with locationID
>>> as the rowkey / Joinkey.
>>> We intend to join this table with a transactional WebLog file in HDFS
>>> (Expected size can be around 2TB).
>>> Joining query will be passed from Pig.
>>> Can we expect a performance improvement when compared with mapreduce
>>> appoach?.
>>>
>>> Regards,
>>> Krishna
>>>
>>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <
>>> [email protected]> wrote:
>>>
>>>> Depends on the datasets size and HBase workload. The best way is to do
>>>> join
>>>> in pig, store it and then use HBase bulk load tool.
>>>> It's general recommendation. I have no idea about your task details
>>>>
>>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <[email protected]>:
>>>>
>>>> > Hi,
>>>> > We have a use case that involves ETL on data coming from several
>>>> different
>>>> > sources using pig.
>>>> > We plan to store the final output table in HBase.
>>>> > What will be the performance impact if we do a join with an external
>>>> CSV
>>>> > table using pig?.
>>>> >
>>>> > Regards,
>>>> > Krishna
>>>> >
>>>>
>>>
>>>
>>
>

Re: Pig HBase integration

Reply via email to