Re: Approach: Incremental data load from HBASE

2017-01-06 Thread ayan guha
IMHO you should not "think" HBase in RDMBS terms, but you can use
ColumnFilters to filter out new records

On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri 
wrote:

> Hi Ayan,
>
> I mean by Incremental load from HBase, weekly running batch jobs takes
> rows from HBase table and dump it out to Hive. Now when next i run Job it
> only takes newly arrived jobs.
>
> Same as if we use Sqoop for incremental load from RDBMS to Hive with below
> command,
>
> sqoop job --create myssb1 -- import --connect
> jdbc:mysql://:/sakila --username admin --password admin
> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
> district, city_id, postal_code, alast_update, cityid, city, country_id,
> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
> a.last_update as alast_update, c.city_id as cityid, c.city as city,
> c.country_id as country_id, c.last_update as clast_update FROM
> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
> address=String
>
> Probably i am looking for any tool from HBase incubator family which does
> the job for me, or other alternative approaches can be done through reading
> Hbase tables in RDD and saving RDD to Hive.
>
> Thanks.
>
>
> On Thu, Jan 5, 2017 at 2:02 AM, ayan guha  wrote:
>
>> Hi Chetan
>>
>> What do you mean by incremental load from HBase? There is a timestamp
>> marker for each cell, but not at Row level.
>>
>> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Ted Yu,
>>>
>>> You understood wrong, i said Incremental load from HBase to Hive,
>>> individually you can say Incremental Import from HBase.
>>>
>>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:
>>>
 Incremental load traditionally means generating hfiles and
 using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
 the data into hbase.

 For your use case, the producer needs to find rows where the flag is 0
 or 1.
 After such rows are obtained, it is up to you how the result of
 processing is delivered to hbase.

 Cheers

 On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
 chetan.opensou...@gmail.com> wrote:

> Ok, Sure will ask.
>
> But what would be generic best practice solution for Incremental load
> from HBASE.
>
> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>
>> I haven't used Gobblin.
>> You can consider asking Gobblin mailing list of the first option.
>>
>> The second option would work.
>>
>>
>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Guys,
>>>
>>> I would like to understand different approach for Distributed
>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>> satisfy requirement ?
>>>
>>> *Approach 1:*
>>>
>>> Write Kafka Producer and maintain manually column flag for events
>>> and ingest it with Linkedin Gobblin to HDFS / S3.
>>>
>>> *Approach 2:*
>>>
>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>> maintain flag column at HBase Level.
>>>
>>> In above both approach, I need to maintain column level flags. such
>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer
>>> will take another 1000 rows of batch where flag is 0 or 1.
>>>
>>> I am looking for best practice approach with any distributed tool.
>>>
>>> Thanks.
>>>
>>> - Chetan Khatri
>>>
>>
>>
>

>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha


Re: Approach: Incremental data load from HBASE

2017-01-06 Thread Chetan Khatri
Ayan, Thanks
Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses !


On Fri, Jan 6, 2017 at 3:23 PM, ayan guha  wrote:

> IMHO you should not "think" HBase in RDMBS terms, but you can use
> ColumnFilters to filter out new records
>
> On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri  > wrote:
>
>> Hi Ayan,
>>
>> I mean by Incremental load from HBase, weekly running batch jobs takes
>> rows from HBase table and dump it out to Hive. Now when next i run Job it
>> only takes newly arrived jobs.
>>
>> Same as if we use Sqoop for incremental load from RDBMS to Hive with
>> below command,
>>
>> sqoop job --create myssb1 -- import --connect
>> jdbc:mysql://:/sakila --username admin --password admin
>> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
>> district, city_id, postal_code, alast_update, cityid, city, country_id,
>> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
>> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
>> a.last_update as alast_update, c.city_id as cityid, c.city as city,
>> c.country_id as country_id, c.last_update as clast_update FROM
>> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
>> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
>> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
>> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
>> address=String
>>
>> Probably i am looking for any tool from HBase incubator family which does
>> the job for me, or other alternative approaches can be done through reading
>> Hbase tables in RDD and saving RDD to Hive.
>>
>> Thanks.
>>
>>
>> On Thu, Jan 5, 2017 at 2:02 AM, ayan guha  wrote:
>>
>>> Hi Chetan
>>>
>>> What do you mean by incremental load from HBase? There is a timestamp
>>> marker for each cell, but not at Row level.
>>>
>>> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Ted Yu,

 You understood wrong, i said Incremental load from HBase to Hive,
 individually you can say Incremental Import from HBase.

 On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:

> Incremental load traditionally means generating hfiles and
> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
> the data into hbase.
>
> For your use case, the producer needs to find rows where the flag is 0
> or 1.
> After such rows are obtained, it is up to you how the result of
> processing is delivered to hbase.
>
> Cheers
>
> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ok, Sure will ask.
>>
>> But what would be generic best practice solution for Incremental load
>> from HBASE.
>>
>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>>
>>> I haven't used Gobblin.
>>> You can consider asking Gobblin mailing list of the first option.
>>>
>>> The second option would work.
>>>
>>>
>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hello Guys,

 I would like to understand different approach for Distributed
 Incremental load from HBase, Is there any *tool / incubactor tool* 
 which
 satisfy requirement ?

 *Approach 1:*

 Write Kafka Producer and maintain manually column flag for events
 and ingest it with Linkedin Gobblin to HDFS / S3.

 *Approach 2:*

 Run Scheduled Spark Job - Read from HBase and do transformations
 and maintain flag column at HBase Level.

 In above both approach, I need to maintain column level flags. such
 as 0 - by default, 1-sent,2-sent and acknowledged. So next time 
 Producer
 will take another 1000 rows of batch where flag is 0 or 1.

 I am looking for best practice approach with any distributed tool.

 Thanks.

 - Chetan Khatri

>>>
>>>
>>
>

>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Approach: Incremental data load from HBASE

2017-01-06 Thread Chetan Khatri
Hi Ayan,

I mean by Incremental load from HBase, weekly running batch jobs takes rows
from HBase table and dump it out to Hive. Now when next i run Job it only
takes newly arrived jobs.

Same as if we use Sqoop for incremental load from RDBMS to Hive with below
command,

sqoop job --create myssb1 -- import --connect
jdbc:mysql://:/sakila --username admin --password admin
--driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
district, city_id, postal_code, alast_update, cityid, city, country_id,
clast_update FROM(SELECT a.address_id as address_id, a.address as address,
a.district as district, a.city_id as city_id, a.postal_code as postal_code,
a.last_update as alast_update, c.city_id as cityid, c.city as city,
c.country_id as country_id, c.last_update as clast_update FROM
sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
--last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
--hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
address=String

Probably i am looking for any tool from HBase incubator family which does
the job for me, or other alternative approaches can be done through reading
Hbase tables in RDD and saving RDD to Hive.

Thanks.


On Thu, Jan 5, 2017 at 2:02 AM, ayan guha  wrote:

> Hi Chetan
>
> What do you mean by incremental load from HBase? There is a timestamp
> marker for each cell, but not at Row level.
>
> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ted Yu,
>>
>> You understood wrong, i said Incremental load from HBase to Hive,
>> individually you can say Incremental Import from HBase.
>>
>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:
>>
>>> Incremental load traditionally means generating hfiles and
>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>>> the data into hbase.
>>>
>>> For your use case, the producer needs to find rows where the flag is 0
>>> or 1.
>>> After such rows are obtained, it is up to you how the result of
>>> processing is delivered to hbase.
>>>
>>> Cheers
>>>
>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Ok, Sure will ask.

 But what would be generic best practice solution for Incremental load
 from HBASE.

 On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:

> I haven't used Gobblin.
> You can consider asking Gobblin mailing list of the first option.
>
> The second option would work.
>
>
> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Guys,
>>
>> I would like to understand different approach for Distributed
>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>> satisfy requirement ?
>>
>> *Approach 1:*
>>
>> Write Kafka Producer and maintain manually column flag for events and
>> ingest it with Linkedin Gobblin to HDFS / S3.
>>
>> *Approach 2:*
>>
>> Run Scheduled Spark Job - Read from HBase and do transformations and
>> maintain flag column at HBase Level.
>>
>> In above both approach, I need to maintain column level flags. such
>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer
>> will take another 1000 rows of batch where flag is 0 or 1.
>>
>> I am looking for best practice approach with any distributed tool.
>>
>> Thanks.
>>
>> - Chetan Khatri
>>
>
>

>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Approach: Incremental data load from HBASE

2017-01-04 Thread ayan guha
Hi Chetan

What do you mean by incremental load from HBase? There is a timestamp
marker for each cell, but not at Row level.

On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri 
wrote:

> Ted Yu,
>
> You understood wrong, i said Incremental load from HBase to Hive,
> individually you can say Incremental Import from HBase.
>
> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:
>
>> Incremental load traditionally means generating hfiles and
>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>> the data into hbase.
>>
>> For your use case, the producer needs to find rows where the flag is 0 or
>> 1.
>> After such rows are obtained, it is up to you how the result of
>> processing is delivered to hbase.
>>
>> Cheers
>>
>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Ok, Sure will ask.
>>>
>>> But what would be generic best practice solution for Incremental load
>>> from HBASE.
>>>
>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>>>
 I haven't used Gobblin.
 You can consider asking Gobblin mailing list of the first option.

 The second option would work.


 On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
 chetan.opensou...@gmail.com> wrote:

> Hello Guys,
>
> I would like to understand different approach for Distributed
> Incremental load from HBase, Is there any *tool / incubactor tool* which
> satisfy requirement ?
>
> *Approach 1:*
>
> Write Kafka Producer and maintain manually column flag for events and
> ingest it with Linkedin Gobblin to HDFS / S3.
>
> *Approach 2:*
>
> Run Scheduled Spark Job - Read from HBase and do transformations and
> maintain flag column at HBase Level.
>
> In above both approach, I need to maintain column level flags. such as
> 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
> take another 1000 rows of batch where flag is 0 or 1.
>
> I am looking for best practice approach with any distributed tool.
>
> Thanks.
>
> - Chetan Khatri
>


>>>
>>
>


-- 
Best Regards,
Ayan Guha


Re: Approach: Incremental data load from HBASE

2017-01-04 Thread Chetan Khatri
Ted Yu,

You understood wrong, i said Incremental load from HBase to Hive,
individually you can say Incremental Import from HBase.

On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:

> Incremental load traditionally means generating hfiles and
> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
> data into hbase.
>
> For your use case, the producer needs to find rows where the flag is 0 or
> 1.
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ok, Sure will ask.
>>
>> But what would be generic best practice solution for Incremental load
>> from HBASE.
>>
>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>>
>>> I haven't used Gobblin.
>>> You can consider asking Gobblin mailing list of the first option.
>>>
>>> The second option would work.
>>>
>>>
>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hello Guys,

 I would like to understand different approach for Distributed
 Incremental load from HBase, Is there any *tool / incubactor tool* which
 satisfy requirement ?

 *Approach 1:*

 Write Kafka Producer and maintain manually column flag for events and
 ingest it with Linkedin Gobblin to HDFS / S3.

 *Approach 2:*

 Run Scheduled Spark Job - Read from HBase and do transformations and
 maintain flag column at HBase Level.

 In above both approach, I need to maintain column level flags. such as
 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
 take another 1000 rows of batch where flag is 0 or 1.

 I am looking for best practice approach with any distributed tool.

 Thanks.

 - Chetan Khatri

>>>
>>>
>>
>


Re: Approach: Incremental data load from HBASE

2016-12-23 Thread Chetan Khatri
Ted Correct, In my case i want Incremental Import from HBASE and
Incremental load to Hive. Both approach discussed earlier with Indexing
seems accurate to me. But like Sqoop support Incremental import and load
for RDBMS, Is there any tool which supports Incremental import from HBase ?



On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:

> Incremental load traditionally means generating hfiles and
> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
> data into hbase.
>
> For your use case, the producer needs to find rows where the flag is 0 or
> 1.
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ok, Sure will ask.
>>
>> But what would be generic best practice solution for Incremental load
>> from HBASE.
>>
>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>>
>>> I haven't used Gobblin.
>>> You can consider asking Gobblin mailing list of the first option.
>>>
>>> The second option would work.
>>>
>>>
>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hello Guys,

 I would like to understand different approach for Distributed
 Incremental load from HBase, Is there any *tool / incubactor tool* which
 satisfy requirement ?

 *Approach 1:*

 Write Kafka Producer and maintain manually column flag for events and
 ingest it with Linkedin Gobblin to HDFS / S3.

 *Approach 2:*

 Run Scheduled Spark Job - Read from HBase and do transformations and
 maintain flag column at HBase Level.

 In above both approach, I need to maintain column level flags. such as
 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
 take another 1000 rows of batch where flag is 0 or 1.

 I am looking for best practice approach with any distributed tool.

 Thanks.

 - Chetan Khatri

>>>
>>>
>>
>


Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
Incremental load traditionally means generating hfiles and
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
data into hbase.

For your use case, the producer needs to find rows where the flag is 0 or 1.
After such rows are obtained, it is up to you how the result of processing
is delivered to hbase.

Cheers

On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri 
wrote:

> Ok, Sure will ask.
>
> But what would be generic best practice solution for Incremental load from
> HBASE.
>
> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>
>> I haven't used Gobblin.
>> You can consider asking Gobblin mailing list of the first option.
>>
>> The second option would work.
>>
>>
>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Guys,
>>>
>>> I would like to understand different approach for Distributed
>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>> satisfy requirement ?
>>>
>>> *Approach 1:*
>>>
>>> Write Kafka Producer and maintain manually column flag for events and
>>> ingest it with Linkedin Gobblin to HDFS / S3.
>>>
>>> *Approach 2:*
>>>
>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>> maintain flag column at HBase Level.
>>>
>>> In above both approach, I need to maintain column level flags. such as 0
>>> - by default, 1-sent,2-sent and acknowledged. So next time Producer will
>>> take another 1000 rows of batch where flag is 0 or 1.
>>>
>>> I am looking for best practice approach with any distributed tool.
>>>
>>> Thanks.
>>>
>>> - Chetan Khatri
>>>
>>
>>
>


Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
Ok, Sure will ask.

But what would be generic best practice solution for Incremental load from
HBASE.

On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:

> I haven't used Gobblin.
> You can consider asking Gobblin mailing list of the first option.
>
> The second option would work.
>
>
> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Guys,
>>
>> I would like to understand different approach for Distributed Incremental
>> load from HBase, Is there any *tool / incubactor tool* which satisfy
>> requirement ?
>>
>> *Approach 1:*
>>
>> Write Kafka Producer and maintain manually column flag for events and
>> ingest it with Linkedin Gobblin to HDFS / S3.
>>
>> *Approach 2:*
>>
>> Run Scheduled Spark Job - Read from HBase and do transformations and
>> maintain flag column at HBase Level.
>>
>> In above both approach, I need to maintain column level flags. such as 0
>> - by default, 1-sent,2-sent and acknowledged. So next time Producer will
>> take another 1000 rows of batch where flag is 0 or 1.
>>
>> I am looking for best practice approach with any distributed tool.
>>
>> Thanks.
>>
>> - Chetan Khatri
>>
>
>


Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
I haven't used Gobblin.
You can consider asking Gobblin mailing list of the first option.

The second option would work.


On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri 
wrote:

> Hello Guys,
>
> I would like to understand different approach for Distributed Incremental
> load from HBase, Is there any *tool / incubactor tool* which satisfy
> requirement ?
>
> *Approach 1:*
>
> Write Kafka Producer and maintain manually column flag for events and
> ingest it with Linkedin Gobblin to HDFS / S3.
>
> *Approach 2:*
>
> Run Scheduled Spark Job - Read from HBase and do transformations and
> maintain flag column at HBase Level.
>
> In above both approach, I need to maintain column level flags. such as 0 -
> by default, 1-sent,2-sent and acknowledged. So next time Producer will take
> another 1000 rows of batch where flag is 0 or 1.
>
> I am looking for best practice approach with any distributed tool.
>
> Thanks.
>
> - Chetan Khatri
>