IMHO you should not "think" HBase in RDMBS terms, but you can use ColumnFilters to filter out new records
On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri <chetan.opensou...@gmail.com> wrote: > Hi Ayan, > > I mean by Incremental load from HBase, weekly running batch jobs takes > rows from HBase table and dump it out to Hive. Now when next i run Job it > only takes newly arrived jobs. > > Same as if we use Sqoop for incremental load from RDBMS to Hive with below > command, > > sqoop job --create myssb1 -- import --connect > jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin > --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address, > district, city_id, postal_code, alast_update, cityid, city, country_id, > clast_update FROM(SELECT a.address_id as address_id, a.address as address, > a.district as district, a.city_id as city_id, a.postal_code as postal_code, > a.last_update as alast_update, c.city_id as cityid, c.city as city, > c.country_id as country_id, c.last_update as clast_update FROM > sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub > WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update > --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import > --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java > address=String > > Probably i am looking for any tool from HBase incubator family which does > the job for me, or other alternative approaches can be done through reading > Hbase tables in RDD and saving RDD to Hive. > > Thanks. > > > On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi Chetan >> >> What do you mean by incremental load from HBase? There is a timestamp >> marker for each cell, but not at Row level. >> >> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Ted Yu, >>> >>> You understood wrong, i said Incremental load from HBase to Hive, >>> individually you can say Incremental Import from HBase. >>> >>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> Incremental load traditionally means generating hfiles and >>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load >>>> the data into hbase. >>>> >>>> For your use case, the producer needs to find rows where the flag is 0 >>>> or 1. >>>> After such rows are obtained, it is up to you how the result of >>>> processing is delivered to hbase. >>>> >>>> Cheers >>>> >>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < >>>> chetan.opensou...@gmail.com> wrote: >>>> >>>>> Ok, Sure will ask. >>>>> >>>>> But what would be generic best practice solution for Incremental load >>>>> from HBASE. >>>>> >>>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> >>>>>> I haven't used Gobblin. >>>>>> You can consider asking Gobblin mailing list of the first option. >>>>>> >>>>>> The second option would work. >>>>>> >>>>>> >>>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < >>>>>> chetan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> Hello Guys, >>>>>>> >>>>>>> I would like to understand different approach for Distributed >>>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which >>>>>>> satisfy requirement ? >>>>>>> >>>>>>> *Approach 1:* >>>>>>> >>>>>>> Write Kafka Producer and maintain manually column flag for events >>>>>>> and ingest it with Linkedin Gobblin to HDFS / S3. >>>>>>> >>>>>>> *Approach 2:* >>>>>>> >>>>>>> Run Scheduled Spark Job - Read from HBase and do transformations and >>>>>>> maintain flag column at HBase Level. >>>>>>> >>>>>>> In above both approach, I need to maintain column level flags. such >>>>>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer >>>>>>> will take another 1000 rows of batch where flag is 0 or 1. >>>>>>> >>>>>>> I am looking for best practice approach with any distributed tool. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> - Chetan Khatri >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > -- Best Regards, Ayan Guha