Re: Approach: Incremental data load from HBASE

Chetan Khatri Fri, 06 Jan 2017 03:08:17 -0800

Ayan, Thanks
Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses !



On Fri, Jan 6, 2017 at 3:23 PM, ayan guha <guha.a...@gmail.com> wrote:

> IMHO you should not "think" HBase in RDMBS terms, but you can use
> ColumnFilters to filter out new records
>
> On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri <chetan.opensou...@gmail.com
> > wrote:
>
>> Hi Ayan,
>>
>> I mean by Incremental load from HBase, weekly running batch jobs takes
>> rows from HBase table and dump it out to Hive. Now when next i run Job it
>> only takes newly arrived jobs.
>>
>> Same as if we use Sqoop for incremental load from RDBMS to Hive with
>> below command,
>>
>> sqoop job --create myssb1 -- import --connect
>> jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
>> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
>> district, city_id, postal_code, alast_update, cityid, city, country_id,
>> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
>> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
>> a.last_update as alast_update, c.city_id as cityid, c.city as city,
>> c.country_id as country_id, c.last_update as clast_update FROM
>> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
>> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
>> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
>> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
>> address=String
>>
>> Probably i am looking for any tool from HBase incubator family which does
>> the job for me, or other alternative approaches can be done through reading
>> Hbase tables in RDD and saving RDD to Hive.
>>
>> Thanks.
>>
>>
>> On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi Chetan
>>>
>>> What do you mean by incremental load from HBase? There is a timestamp
>>> marker for each cell, but not at Row level.
>>>
>>> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
>>>> Ted Yu,
>>>>
>>>> You understood wrong, i said Incremental load from HBase to Hive,
>>>> individually you can say Incremental Import from HBase.
>>>>
>>>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> Incremental load traditionally means generating hfiles and
>>>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>>>>> the data into hbase.
>>>>>
>>>>> For your use case, the producer needs to find rows where the flag is 0
>>>>> or 1.
>>>>> After such rows are obtained, it is up to you how the result of
>>>>> processing is delivered to hbase.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Ok, Sure will ask.
>>>>>>
>>>>>> But what would be generic best practice solution for Incremental load
>>>>>> from HBASE.
>>>>>>
>>>>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>
>>>>>>> I haven't used Gobblin.
>>>>>>> You can consider asking Gobblin mailing list of the first option.
>>>>>>>
>>>>>>> The second option would work.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello Guys,
>>>>>>>>
>>>>>>>> I would like to understand different approach for Distributed
>>>>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* 
>>>>>>>> which
>>>>>>>> satisfy requirement ?
>>>>>>>>
>>>>>>>> *Approach 1:*
>>>>>>>>
>>>>>>>> Write Kafka Producer and maintain manually column flag for events
>>>>>>>> and ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>>>>
>>>>>>>> *Approach 2:*
>>>>>>>>
>>>>>>>> Run Scheduled Spark Job - Read from HBase and do transformations
>>>>>>>> and maintain flag column at HBase Level.
>>>>>>>>
>>>>>>>> In above both approach, I need to maintain column level flags. such
>>>>>>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time 
>>>>>>>> Producer
>>>>>>>> will take another 1000 rows of batch where flag is 0 or 1.
>>>>>>>>
>>>>>>>> I am looking for best practice approach with any distributed tool.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> - Chetan Khatri
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Approach: Incremental data load from HBASE

Reply via email to