https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html I'm not sure how does Pig HBaseStroage works. I suppose it would read all data and then join it as usual dataset. So you should get serious hbase perfomace degradation during read, you would get key-by-key read from the whole table. 1. so join in pig 2. At first you load data from hbase table then operate on it. I don't see a cse where you can use hbase table directly in join.
2014-09-28 17:02 GMT+04:00 Krishna Kalyan <[email protected]>: > > We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns > in each record) and weblog (2-3 TB, approx 50 columns in each record). We > need to join the data sets using the locationId, which is in both the > data-sets. > > We have 2 options: > 1. Have both the data-sets in HDFS only and JOIN then on locationId, may > be using Pig. > 2. Since JOIN will be on locaitonId, which is primary key for location > data set, if we store the location data set with locationId as rowkey in > HBase and then use Pig query to do the join of weblog data set and location > table (using HBaseStorage). > > The reason to think about this idea is reading data based on the key is > faster in HBase, however we are not sure that in case of JOIN of 2 data > sets, will Pig internally go for picking the individual location record for > based on key or it reads through entire or few records from location table > and then do the join. Based on this we can make the choice. > > We are free to use HDFS or HBase for any input or output data set, please > advise which option can provide us better performance. Also if required, > please point us to some good article on this. > > > On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <[email protected]> > wrote: > >> store location to hdfs >> store weblog to hdfs >> join them >> use HBase bulk load tool to load join result to hbase. >> >> What's the reason to keep location dataset in hbase and weblogs in hdfs? >> >> You can expect data load perfomance improvement. For me it takes few >> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted >> table. >> >> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <[email protected]>: >> >>> Thanks Serega, >>> >>> Our usecase details: >>> We have a location table which will be stored in HBase with locationID >>> as the rowkey / Joinkey. >>> We intend to join this table with a transactional WebLog file in HDFS >>> (Expected size can be around 2TB). >>> Joining query will be passed from Pig. >>> Can we expect a performance improvement when compared with mapreduce >>> appoach?. >>> >>> Regards, >>> Krishna >>> >>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak < >>> [email protected]> wrote: >>> >>>> Depends on the datasets size and HBase workload. The best way is to do >>>> join >>>> in pig, store it and then use HBase bulk load tool. >>>> It's general recommendation. I have no idea about your task details >>>> >>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <[email protected]>: >>>> >>>> > Hi, >>>> > We have a use case that involves ETL on data coming from several >>>> different >>>> > sources using pig. >>>> > We plan to store the final output table in HBase. >>>> > What will be the performance impact if we do a join with an external >>>> CSV >>>> > table using pig?. >>>> > >>>> > Regards, >>>> > Krishna >>>> > >>>> >>> >>> >> >
