We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
in each record) and weblog (2-3 TB, approx 50 columns in each record). We
need to join the data sets using the locationId, which is in both the
data-sets.

We have 2 options:
1. Have both the data-sets in HDFS only and JOIN then on locationId, may be
using Pig.
2. Since JOIN will be on locaitonId, which is primary key for location data
set, if we store the location data set with locationId as rowkey in HBase
and then use Pig query to do the join of weblog data set and location table
(using HBaseStorage).

The reason to think about this idea is reading data based on the key is
faster in HBase, however we are not sure that in case of JOIN of 2 data
sets, will Pig internally go for picking the individual location record for
based on key or it reads through entire or few records from location table
and then do the join. Based on this we can make the choice.

We are free to use HDFS or HBase for any input or output data set, please
advise which option can provide us better performance. Also if required,
please point us to some good article on this.


On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <serega.shey...@gmail.com>
wrote:

> store location to hdfs
> store weblog to hdfs
> join them
> use HBase bulk load tool to load join result to hbase.
>
> What's the reason to keep location dataset in hbase and weblogs in hdfs?
>
> You can expect data load perfomance improvement. For me it takes few
> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
> table.
>
> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <krishnakaly...@gmail.com>:
>
>> Thanks Serega,
>>
>> Our usecase details:
>> We have a location table which will be stored in HBase with locationID as
>> the rowkey / Joinkey.
>> We intend to join this table with a transactional WebLog file in HDFS
>> (Expected size can be around 2TB).
>> Joining query will be passed from Pig.
>> Can we expect a performance improvement when compared with mapreduce
>> appoach?.
>>
>> Regards,
>> Krishna
>>
>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <serega.shey...@gmail.com
>> > wrote:
>>
>>> Depends on the datasets size and HBase workload. The best way is to do
>>> join
>>> in pig, store it and then use HBase bulk load tool.
>>> It's general recommendation. I have no idea about your task details
>>>
>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <krishnakaly...@gmail.com>:
>>>
>>> > Hi,
>>> > We have a use case that involves ETL on data coming from several
>>> different
>>> > sources using pig.
>>> > We plan to store the final output table in HBase.
>>> > What will be the performance impact if we do a join with an external
>>> CSV
>>> > table using pig?.
>>> >
>>> > Regards,
>>> > Krishna
>>> >
>>>
>>
>>
>

Reply via email to