Re: Hbase Timestamp Queries

2015-06-10 Thread Krishna Kalyan
Hi Talat,
That should should work.
Another example would be something like below.

test  = LOAD '$TEST'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:name
cf_data:age', '-loadKey true -maxTimestamp $test_date')
as (age);


On Wed, Jun 10, 2015 at 12:57 PM, Talat Uyarer ta...@uyarer.com wrote:

 Hi Ted Yu,

 I guess Krishna mention about Pig's HBaseStorage class. I found out
 this by searching the class on google. IMHO I find a solution for my
 problem. I can use Scan.setTimeRange[0] method. If I want to get
 smaller records from timestamp, minTimestamp is set 0 and maxtimestamp
 is set the timestamp. I guess this solution works for my problem.

 Thanks Ted and Krishna.

 [0]
 https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setTimeRange(long,%20long)

 2015-06-09 16:08 GMT+03:00 Ted Yu yuzhih...@gmail.com:
  Is HBaseStorage class in hbase repo ?
 
  I can't find it in 0.98 or hbase-1 branches.
 
  Cheers
 
  On Tue, Jun 9, 2015 at 3:51 AM, Krishna Kalyan krishnakaly...@gmail.com
 
  wrote:
 
  Yes you can. Have a look at the HBaseStorage class.
  On 9 Jun 2015 1:04 pm, Talat Uyarer ta...@uyarer.com wrote:
 
   Hi,
  
   Can I use HBase's timestamps to gettting rows greater/smaller than
   this timestamp . Is it possible ?
  
   Thanks
  
   --
   Talat UYARER
   Websitesi: http://talat.uyarer.com
   Twitter: http://twitter.com/talatuyarer
   Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
  
 



 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304



Re: Hbase Timestamp Queries

2015-06-09 Thread Krishna Kalyan
Yes you can. Have a look at the HBaseStorage class.
On 9 Jun 2015 1:04 pm, Talat Uyarer ta...@uyarer.com wrote:

 Hi,

 Can I use HBase's timestamps to gettting rows greater/smaller than
 this timestamp . Is it possible ?

 Thanks

 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304



Re: Questions related to HBase general use

2015-05-13 Thread Krishna Kalyan
I know that BigInsights comes with BigSQL which interacts with HBase as
well, have you considered that option.
We have a similar use case using BigInsights 2.1.2.


On Thu, May 14, 2015 at 4:56 AM, Nick Dimiduk ndimi...@gmail.com wrote:

 + Swarnim, who's expert on HBase/Hive integration.

 Yes, snapshots may be interesting for you. I believe Hive can access HBase
 timestamps, exposed as a virtual column. It's assumed across there whole
 row however, not per cell.

 On Sun, May 10, 2015 at 9:14 PM, Jerry He jerry...@gmail.com wrote:

  Hi, Yong
 
  You have a good understanding of the benefit of HBase already.
  Generally speaking, HBase is suitable for real time read/write to your
 big
  data set.
  Regarding the HBase performance evaluation tool, the 'read' test use
 HBase
  'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the
 server.
  The 'scan' test scans the table and transfers the rows to the client in
  batches (e.g. 100 rows at a time), which will take shorter time for the
  whole test to complete for the same number of rows.
  The hive/hbase integration, as you said, needs more consideration.
  1) The performance.  Hive access HBase via HBase client API, which
 involves
  going to the HBase server for all the data access. This will slow things
  down.
  There are a couple of things you can explore. e.g. Hive/HBase
 snapshot
  integration. This would provide direct access to HBase hfiles.
  2) In your email, you are interested in HBase's capability of storing
  multiple versions of data.  You need to consider if Hive supports this
  HBase feature. i.e provide you access to multi versions. As I can
 remember,
  it is not fully.
 
  Jerry
 
 
  On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote:
 
   Hi,
   I am kind of new to HBase. Currently our production run IBM BigInsight
  V3,
   comes with Hadoop 2.2 and HBase 0.96.0.
   We are mostly using HDFS and Hive/Pig for our BigData project, it works
   very good for our big datasets. Right now, we have a one dataset needs
 to
   be loaded from Mysql, about 100G, and will have about Gs change daily.
  This
   is a very important slow change dimension data, we like to sync between
   Mysql and BigData platform.
   I am thinking of using HBase to store it, instead of refreshing the
 whole
   dataset in HDFS, due to:
   1) HBase makes the merge the change very easy.2) HBase could store all
  the
   changes in the history, as a function out of box. We will replicate all
  the
   changes from the binlog level from Mysql, and we could keep all changes
  in
   HBase (or long history), then it can give us some insight that cannot
 be
   done easily in HDFS.3) HBase could give us the benefit to access the
 data
   by key fast, for some cases.4) HBase is available out of box.
   What I am not sure is the Hive/HBase integration. Hive is the top tool
 in
   our environment. If one dataset stored in Hbase (even only about 100G
 as
   now), the join between it with the other Big datasets in HDFS worries
  me. I
   read quite some information about Hive/HBase integration, and feel that
  it
   is not really mature, as not too many usage cases I can find online,
   especially on performance. There are quite some JIRAs related to make
  Hive
   utilize the HBase for performance in MR job are still pending.
   I want to know other people experience to use HBase in this way. I
   understand HBase is not designed as a storage system for Data Warehouse
   component or analytics engine. But the benefits to use HBase in this
 case
   still attractive me. If my use cases of HBase is mostly read or full
 scan
   the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
   To help me understand the read throughput of HBase, I use the HBase
   performance evaluation tool, but the output is quite confusing. I have
 2
   clusters, one is with 5 nodes with 3 slaves all running on VM (Each
 with
   24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
   cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48
  mapper
   slots + 24 reducer slots).Below is the result I run the sequentialRead
  3
   on the better cluster:
   15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50
   INFO mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO
   mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO
   mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
   mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
   mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
   mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07
  17:26:50
   INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
   mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
   mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
   mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 

Incorrect Dump using HBase Storage Class

2014-12-28 Thread Krishna Kalyan
Hi,
Happy holidays :).
I have 2 different pig scripts with the statement below
(1)
GeoRef_IP = LOAD '$TBL_GEOGRAPHY' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_geog_id
cf_data:cq_pc_sector cf_data:cq_district_code cf_data:cq_postal_town
cf_data:cq_postal_county cf_data:cq_mosaic_code cf_data:cq_mosaic_code_desc
cf_data:cq_mosaic_group cf_data:cq_sales_territory cf_data:cq_sales_area
cf_data:cq_sales_region cf_data:cq_dqtimestamp cf_data:cq_checkarray',
'-loadKey true');

and
(2)
GeoRef_IP = LOAD '$TBL_GEOGRAPHY' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_geog_id
cf_data:cq_pc_sector cf_data:cq_district_code cf_data:cq_postal_town
cf_data:cq_postal_county cf_data:cq_mosaic_code cf_data:cq_mosaic_code_desc
cf_data:cq_mosaic_group cf_data:cq_sales_territory cf_data:cq_sales_area
cf_data:cq_sales_region cf_data:cq_dqtimestamp cf_data:cq_checkarray',
'-loadKey true') as
(postcode,geog_id,pc_sector,district_code,postal_town,postal_county,mosaic_code,mosaic_code_desc,mosaic_group,sales_territory,sales_area,sales_region,dqtimestamp,checkarray);

the only difference is as statement.

now for example
A foreach of $0,$4,$5 and a dump gives me different results for statement 1
and 2.
where 1 is correct.

Has anyone faced this behavior before?.

Regards,
Krishna


Re: Unable to fetch data from HBase.

2014-11-27 Thread Krishna Kalyan
Restart your zookeeper.
Restart your HBase.

This might be a short term fix.

Thanks,
Krishna

On Thu, Nov 27, 2014 at 7:57 PM, dhamodharan.ramalin...@tcs.com wrote:

 Hi,

 The issue reported earlier is resolved, I now have a new issue.

 When I execute list command from the hbase shell, I am getting the
 following error

 Can't get master address from ZooKeeper; znode data == null

 Any help is greatly appricated.

 Thanks  Regards
 Dhamodharan Ramalingam



 From:   Stack st...@duboce.net
 To: Hbase-User user@hbase.apache.org
 Date:   11/27/2014 06:58 PM
 Subject:Re: Unable to fetch data from HBase.
 Sent by:saint@gmail.com



 Is the hbase server running?  Is it at localhost/127.0.0.1:60020  ?  You
 are getting connection refused...
 St.Ack

 On Thu, Nov 27, 2014 at 5:13 AM, dhamodharan.ramalin...@tcs.com wrote:

  Hi,
 
  I am using Hadoop 2.5.1 and HBase 0.98.8-hadoop2 stand-alone mode,  when
 I
  use the following client side code
 
  public static void main(final String[] args) {
 
  HTableInterface table = null;
 
  try {
 
  final HBaseManager tableManager =
  HBaseManager.getInstance();
 
  table = tableManager
   .getHTable(HBaseConstants.TABLE_EMP_DETAILS);
 
  final Scan scan = new Scan();
 
  final ResultScanner resultScanner =
  table.getScanner(scan);
 
  for (final Result result : resultScanner) {
 
  LOG.debug(The Employee id : 
  +
  HBaseHelper.getValueFromResult(result,
   HBaseConstants.COLUME_FAMILY,
   HBaseConstants.EMP_ID));
 
  LOG.debug(The Employee Name to id : 
  +
  HBaseHelper.getValueFromResult(result,
   HBaseConstants.COLUME_FAMILY,
   HBaseConstants.EMP_NAME));
  }
 
 
  } catch (final Exception ex) {
  // TODO Auto-generated catch block
  }
  }
 
  When I execute the code I am getting the following error, I have
 firewall
  disabled for the port. Still I am getting this error.
 
  27 Nov 2014 18:33:01,484 196074 [main] DEBUG
  org.apache.hadoop.ipc.RpcClient  - Connecting to
 localhost/127.0.0.1:60020
  27 Nov 2014 18:33:02,484 197074 [main] DEBUG
  org.apache.hadoop.ipc.RpcClient  - IPC Client (21276817) connection to
  localhost/127.0.0.1:60020 from 394728: closing ipc connection to
  localhost/127.0.0.1:60020: Connection refused: no further information
  java.net.ConnectException: Connection refused: no further information
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
  at org.apache.hadoop.net.SocketIOWithTimeout.connect(
  SocketIOWithTimeout.java:206)
  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
  at
  org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnection(
  RpcClient.java:578)
  at
  org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(
  RpcClient.java:868)
  at org.apache.hadoop.hbase.ipc.RpcClient.getConnection(
  RpcClient.java:1543)
  at
 org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442)
  at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
  RpcClient.java:1661)
  at
 
 

 org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(
  RpcClient.java:1719)
  at
 
 

 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(
  ClientProtos.java:30363)
  at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(
  ProtobufUtil.java:1546)
  at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:717)
  at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:715)
  at
  org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(
  RpcRetryingCaller.java:117)
  at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(
  HTable.java:721)
  at org.apache.hadoop.hbase.client.MetaScanner.metaScan(
  MetaScanner.java:144)
  at
 
 

 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(
  HConnectionManager.java:1140)
  at
 
 

 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(
  HConnectionManager.java:1202)
  at
 
 

 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(
  HConnectionManager.java:1092)
  at
 
 

 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(
  HConnectionManager.java:1049)
  at
 
 

 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(
  HConnectionManager.java:890)
  at 

Version in HBase

2014-11-11 Thread Krishna Kalyan
Hi,
Is it possible to do a
select * from table_name where version = somedate ; using HBase APIs?.
(Scanning for values where version = somedate )
Could you please direct me to appropriate links to achieve this?.


Regards,
Krishna


Re: Version in HBase

2014-11-11 Thread Krishna Kalyan
For Example for table 'test_table', Values inserted are:

Row1 - Val1 = t
Row1 - Val2 = t + 3
Row1 - Val3 = t + 5

Row2 - Val1 = t
Row2 - Val2 = t + 3
Row2 - Val3 = t + 5

on scan 'test_table' where version = t + 4 should return
Row1 - Val1 = t + 3
Row2 - Val2 = t + 3

How do i achieve time stamp based scans?.

Thanks and Regards,
Krishna




On Wed, Nov 12, 2014 at 10:56 AM, Krishna Kalyan krishnakaly...@gmail.com
wrote:

 Hi,
 Is it possible to do a
 select * from table_name where version = somedate ; using HBase APIs?.
 (Scanning for values where version = somedate )
 Could you please direct me to appropriate links to achieve this?.


 Regards,
 Krishna





Re: how to design hbase schema?

2014-11-02 Thread Krishna Kalyan
Some Resources
http://hbase.apache.org/book/schema.casestudies.html
http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012
http://www.evanconkle.com/2011/11/hbase-tutorial-creating-table/
http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies

On Sun, Nov 2, 2014 at 6:53 PM, jackie jackiehbaseu...@126.com wrote:

 Hi!
   I have data warehouse application(base on oracle database), i want
 to transfer it to hbase database ,how to design hbase tables especially 1:n
 or n:m relationship  in oracle database !


  Thank U very much!


 jackie










Duplicate Value Inserts in HBase

2014-10-21 Thread Krishna Kalyan
Hi,
I have a HBase table which is populated from pig using PigStorage.
While inserting, suppose for rowkey i have a duplicate value.
Is there a way to prevent an update?.
I want to maintain the version history for my values which are unique.

Regards,
Krishna


Re: Duplicate Value Inserts in HBase

2014-10-21 Thread Krishna Kalyan
Thanks Jean,
If i put the same value in my table for a particular column for a rowkey i
want HBase reject this value and retain old value with old time stamp.
In other words update only when value changes.

Regards,
Krishna

On Tue, Oct 21, 2014 at 6:02 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Krishna,

 HBase will store them in the same row, same cell but you will have 2
 versions. If you want to keep just one, setup the version=1 on the table
 side and only one will be stored. Is that what yo mean?

 JM

 2014-10-21 8:29 GMT-04:00 Krishna Kalyan krishnakaly...@gmail.com:

  Hi,
  I have a HBase table which is populated from pig using PigStorage.
  While inserting, suppose for rowkey i have a duplicate value.
  Is there a way to prevent an update?.
  I want to maintain the version history for my values which are unique.
 
  Regards,
  Krishna
 



Re: Duplicate Value Inserts in HBase

2014-10-21 Thread Krishna Kalyan
Thanks for you replies Jean,Dhaval

On Tue, Oct 21, 2014 at 6:57 PM, Dhaval Shah prince_mithi...@yahoo.co.in
wrote:

 You can achieve what you want using versions and some hackery with
 timestamps


 Sent from my T-Mobile 4G LTE Device


  Original message 
 From: Jean-Marc Spaggiari jean-m...@spaggiari.org
 Date:10/21/2014  9:02 AM  (GMT-05:00)
 To: user user@hbase.apache.org
 Cc:
 Subject: Re: Duplicate Value Inserts in HBase

 You can do check and puts to validate if value is already there, but it's
 slower...

 2014-10-21 8:50 GMT-04:00 Krishna Kalyan krishnakaly...@gmail.com:

  Thanks Jean,
  If i put the same value in my table for a particular column for a rowkey
 i
  want HBase reject this value and retain old value with old time stamp.
  In other words update only when value changes.
 
  Regards,
  Krishna
 
  On Tue, Oct 21, 2014 at 6:02 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
   Hi Krishna,
  
   HBase will store them in the same row, same cell but you will have 2
   versions. If you want to keep just one, setup the version=1 on the
 table
   side and only one will be stored. Is that what yo mean?
  
   JM
  
   2014-10-21 8:29 GMT-04:00 Krishna Kalyan krishnakaly...@gmail.com:
  
Hi,
I have a HBase table which is populated from pig using PigStorage.
While inserting, suppose for rowkey i have a duplicate value.
Is there a way to prevent an update?.
I want to maintain the version history for my values which are
 unique.
   
Regards,
Krishna
   
  
 



Re: Pig HBase integration

2014-09-29 Thread Krishna Kalyan
Thank you so much Serega.

Regards,
Krishna

On Sun, Sep 28, 2014 at 11:01 PM, Serega Sheypak serega.shey...@gmail.com
wrote:


 https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html
 I'm not sure how does Pig HBaseStroage works. I suppose it would read all
 data and then join it as usual dataset. So you should get serious hbase
 perfomace degradation during read, you would get key-by-key read from the
 whole table.
 1. so join in pig
 2. At first you load data from hbase table then operate on it. I don't see
 a cse where you can use hbase table directly in join.


 2014-09-28 17:02 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com:


 We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
 in each record) and weblog (2-3 TB, approx 50 columns in each record). We
 need to join the data sets using the locationId, which is in both the
 data-sets.

 We have 2 options:
 1. Have both the data-sets in HDFS only and JOIN then on locationId, may
 be using Pig.
 2. Since JOIN will be on locaitonId, which is primary key for location
 data set, if we store the location data set with locationId as rowkey in
 HBase and then use Pig query to do the join of weblog data set and location
 table (using HBaseStorage).

 The reason to think about this idea is reading data based on the key is
 faster in HBase, however we are not sure that in case of JOIN of 2 data
 sets, will Pig internally go for picking the individual location record for
 based on key or it reads through entire or few records from location table
 and then do the join. Based on this we can make the choice.

 We are free to use HDFS or HBase for any input or output data set, please
 advise which option can provide us better performance. Also if required,
 please point us to some good article on this.


 On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak serega.shey...@gmail.com
  wrote:

 store location to hdfs
 store weblog to hdfs
 join them
 use HBase bulk load tool to load join result to hbase.

 What's the reason to keep location dataset in hbase and weblogs in hdfs?

 You can expect data load perfomance improvement. For me it takes few
 minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
 table.

 2014-09-28 16:04 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com:

 Thanks Serega,

 Our usecase details:
 We have a location table which will be stored in HBase with locationID
 as the rowkey / Joinkey.
 We intend to join this table with a transactional WebLog file in HDFS
 (Expected size can be around 2TB).
 Joining query will be passed from Pig.
 Can we expect a performance improvement when compared with mapreduce
 appoach?.

 Regards,
 Krishna

 On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak 
 serega.shey...@gmail.com wrote:

 Depends on the datasets size and HBase workload. The best way is to do
 join
 in pig, store it and then use HBase bulk load tool.
 It's general recommendation. I have no idea about your task details

 2014-09-27 7:32 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com:

  Hi,
  We have a use case that involves ETL on data coming from several
 different
  sources using pig.
  We plan to store the final output table in HBase.
  What will be the performance impact if we do a join with an external
 CSV
  table using pig?.
 
  Regards,
  Krishna
 








Re: Pig HBase integration

2014-09-28 Thread Krishna Kalyan
Thanks Serega,

Our usecase details:
We have a location table which will be stored in HBase with locationID as
the rowkey / Joinkey.
We intend to join this table with a transactional WebLog file in HDFS
(Expected size can be around 2TB).
Joining query will be passed from Pig.
Can we expect a performance improvement when compared with mapreduce
appoach?.

Regards,
Krishna

On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak serega.shey...@gmail.com
wrote:

 Depends on the datasets size and HBase workload. The best way is to do join
 in pig, store it and then use HBase bulk load tool.
 It's general recommendation. I have no idea about your task details

 2014-09-27 7:32 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com:

  Hi,
  We have a use case that involves ETL on data coming from several
 different
  sources using pig.
  We plan to store the final output table in HBase.
  What will be the performance impact if we do a join with an external CSV
  table using pig?.
 
  Regards,
  Krishna
 



Re: Pig HBase integration

2014-09-28 Thread Krishna Kalyan
We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
in each record) and weblog (2-3 TB, approx 50 columns in each record). We
need to join the data sets using the locationId, which is in both the
data-sets.

We have 2 options:
1. Have both the data-sets in HDFS only and JOIN then on locationId, may be
using Pig.
2. Since JOIN will be on locaitonId, which is primary key for location data
set, if we store the location data set with locationId as rowkey in HBase
and then use Pig query to do the join of weblog data set and location table
(using HBaseStorage).

The reason to think about this idea is reading data based on the key is
faster in HBase, however we are not sure that in case of JOIN of 2 data
sets, will Pig internally go for picking the individual location record for
based on key or it reads through entire or few records from location table
and then do the join. Based on this we can make the choice.

We are free to use HDFS or HBase for any input or output data set, please
advise which option can provide us better performance. Also if required,
please point us to some good article on this.


On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak serega.shey...@gmail.com
wrote:

 store location to hdfs
 store weblog to hdfs
 join them
 use HBase bulk load tool to load join result to hbase.

 What's the reason to keep location dataset in hbase and weblogs in hdfs?

 You can expect data load perfomance improvement. For me it takes few
 minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
 table.

 2014-09-28 16:04 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com:

 Thanks Serega,

 Our usecase details:
 We have a location table which will be stored in HBase with locationID as
 the rowkey / Joinkey.
 We intend to join this table with a transactional WebLog file in HDFS
 (Expected size can be around 2TB).
 Joining query will be passed from Pig.
 Can we expect a performance improvement when compared with mapreduce
 appoach?.

 Regards,
 Krishna

 On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak serega.shey...@gmail.com
  wrote:

 Depends on the datasets size and HBase workload. The best way is to do
 join
 in pig, store it and then use HBase bulk load tool.
 It's general recommendation. I have no idea about your task details

 2014-09-27 7:32 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com:

  Hi,
  We have a use case that involves ETL on data coming from several
 different
  sources using pig.
  We plan to store the final output table in HBase.
  What will be the performance impact if we do a join with an external
 CSV
  table using pig?.
 
  Regards,
  Krishna
 






Pig HBase integration

2014-09-26 Thread Krishna Kalyan
Hi,
We have a use case that involves ETL on data coming from several different
sources using pig.
We plan to store the final output table in HBase.
What will be the performance impact if we do a join with an external CSV
table using pig?.

Regards,
Krishna