Re: Approach: Incremental data load from HBASE
IMHO you should not "think" HBase in RDMBS terms, but you can use ColumnFilters to filter out new records On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri wrote: > Hi Ayan, > > I mean by Incremental load from HBase, weekly running batch jobs takes > rows from HBase table and dump it out to Hive. Now when next i run Job it > only takes newly arrived jobs. > > Same as if we use Sqoop for incremental load from RDBMS to Hive with below > command, > > sqoop job --create myssb1 -- import --connect > jdbc:mysql://:/sakila --username admin --password admin > --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address, > district, city_id, postal_code, alast_update, cityid, city, country_id, > clast_update FROM(SELECT a.address_id as address_id, a.address as address, > a.district as district, a.city_id as city_id, a.postal_code as postal_code, > a.last_update as alast_update, c.city_id as cityid, c.city as city, > c.country_id as country_id, c.last_update as clast_update FROM > sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub > WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update > --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import > --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java > address=String > > Probably i am looking for any tool from HBase incubator family which does > the job for me, or other alternative approaches can be done through reading > Hbase tables in RDD and saving RDD to Hive. > > Thanks. > > > On Thu, Jan 5, 2017 at 2:02 AM, ayan guha wrote: > >> Hi Chetan >> >> What do you mean by incremental load from HBase? There is a timestamp >> marker for each cell, but not at Row level. >> >> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Ted Yu, >>> >>> You understood wrong, i said Incremental load from HBase to Hive, >>> individually you can say Incremental Import from HBase. >>> >>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: >>> Incremental load traditionally means generating hfiles and using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the data into hbase. For your use case, the producer needs to find rows where the flag is 0 or 1. After such rows are obtained, it is up to you how the result of processing is delivered to hbase. Cheers On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < chetan.opensou...@gmail.com> wrote: > Ok, Sure will ask. > > But what would be generic best practice solution for Incremental load > from HBASE. > > On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: > >> I haven't used Gobblin. >> You can consider asking Gobblin mailing list of the first option. >> >> The second option would work. >> >> >> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Hello Guys, >>> >>> I would like to understand different approach for Distributed >>> Incremental load from HBase, Is there any *tool / incubactor tool* which >>> satisfy requirement ? >>> >>> *Approach 1:* >>> >>> Write Kafka Producer and maintain manually column flag for events >>> and ingest it with Linkedin Gobblin to HDFS / S3. >>> >>> *Approach 2:* >>> >>> Run Scheduled Spark Job - Read from HBase and do transformations and >>> maintain flag column at HBase Level. >>> >>> In above both approach, I need to maintain column level flags. such >>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer >>> will take another 1000 rows of batch where flag is 0 or 1. >>> >>> I am looking for best practice approach with any distributed tool. >>> >>> Thanks. >>> >>> - Chetan Khatri >>> >> >> > >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > -- Best Regards, Ayan Guha
Re: Approach: Incremental data load from HBASE
Ayan, Thanks Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses ! On Fri, Jan 6, 2017 at 3:23 PM, ayan guha wrote: > IMHO you should not "think" HBase in RDMBS terms, but you can use > ColumnFilters to filter out new records > > On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri > wrote: > >> Hi Ayan, >> >> I mean by Incremental load from HBase, weekly running batch jobs takes >> rows from HBase table and dump it out to Hive. Now when next i run Job it >> only takes newly arrived jobs. >> >> Same as if we use Sqoop for incremental load from RDBMS to Hive with >> below command, >> >> sqoop job --create myssb1 -- import --connect >> jdbc:mysql://:/sakila --username admin --password admin >> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address, >> district, city_id, postal_code, alast_update, cityid, city, country_id, >> clast_update FROM(SELECT a.address_id as address_id, a.address as address, >> a.district as district, a.city_id as city_id, a.postal_code as postal_code, >> a.last_update as alast_update, c.city_id as cityid, c.city as city, >> c.country_id as country_id, c.last_update as clast_update FROM >> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub >> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update >> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import >> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java >> address=String >> >> Probably i am looking for any tool from HBase incubator family which does >> the job for me, or other alternative approaches can be done through reading >> Hbase tables in RDD and saving RDD to Hive. >> >> Thanks. >> >> >> On Thu, Jan 5, 2017 at 2:02 AM, ayan guha wrote: >> >>> Hi Chetan >>> >>> What do you mean by incremental load from HBase? There is a timestamp >>> marker for each cell, but not at Row level. >>> >>> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri < >>> chetan.opensou...@gmail.com> wrote: >>> Ted Yu, You understood wrong, i said Incremental load from HBase to Hive, individually you can say Incremental Import from HBase. On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: > Incremental load traditionally means generating hfiles and > using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load > the data into hbase. > > For your use case, the producer needs to find rows where the flag is 0 > or 1. > After such rows are obtained, it is up to you how the result of > processing is delivered to hbase. > > Cheers > > On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ok, Sure will ask. >> >> But what would be generic best practice solution for Incremental load >> from HBASE. >> >> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: >> >>> I haven't used Gobblin. >>> You can consider asking Gobblin mailing list of the first option. >>> >>> The second option would work. >>> >>> >>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < >>> chetan.opensou...@gmail.com> wrote: >>> Hello Guys, I would like to understand different approach for Distributed Incremental load from HBase, Is there any *tool / incubactor tool* which satisfy requirement ? *Approach 1:* Write Kafka Producer and maintain manually column flag for events and ingest it with Linkedin Gobblin to HDFS / S3. *Approach 2:* Run Scheduled Spark Job - Read from HBase and do transformations and maintain flag column at HBase Level. In above both approach, I need to maintain column level flags. such as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will take another 1000 rows of batch where flag is 0 or 1. I am looking for best practice approach with any distributed tool. Thanks. - Chetan Khatri >>> >>> >> > >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> > > > -- > Best Regards, > Ayan Guha >
Re: Approach: Incremental data load from HBASE
Hi Ayan, I mean by Incremental load from HBase, weekly running batch jobs takes rows from HBase table and dump it out to Hive. Now when next i run Job it only takes newly arrived jobs. Same as if we use Sqoop for incremental load from RDBMS to Hive with below command, sqoop job --create myssb1 -- import --connect jdbc:mysql://:/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update FROM(SELECT a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update FROM sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String Probably i am looking for any tool from HBase incubator family which does the job for me, or other alternative approaches can be done through reading Hbase tables in RDD and saving RDD to Hive. Thanks. On Thu, Jan 5, 2017 at 2:02 AM, ayan guha wrote: > Hi Chetan > > What do you mean by incremental load from HBase? There is a timestamp > marker for each cell, but not at Row level. > > On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ted Yu, >> >> You understood wrong, i said Incremental load from HBase to Hive, >> individually you can say Incremental Import from HBase. >> >> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: >> >>> Incremental load traditionally means generating hfiles and >>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load >>> the data into hbase. >>> >>> For your use case, the producer needs to find rows where the flag is 0 >>> or 1. >>> After such rows are obtained, it is up to you how the result of >>> processing is delivered to hbase. >>> >>> Cheers >>> >>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < >>> chetan.opensou...@gmail.com> wrote: >>> Ok, Sure will ask. But what would be generic best practice solution for Incremental load from HBASE. On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: > I haven't used Gobblin. > You can consider asking Gobblin mailing list of the first option. > > The second option would work. > > > On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Guys, >> >> I would like to understand different approach for Distributed >> Incremental load from HBase, Is there any *tool / incubactor tool* which >> satisfy requirement ? >> >> *Approach 1:* >> >> Write Kafka Producer and maintain manually column flag for events and >> ingest it with Linkedin Gobblin to HDFS / S3. >> >> *Approach 2:* >> >> Run Scheduled Spark Job - Read from HBase and do transformations and >> maintain flag column at HBase Level. >> >> In above both approach, I need to maintain column level flags. such >> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer >> will take another 1000 rows of batch where flag is 0 or 1. >> >> I am looking for best practice approach with any distributed tool. >> >> Thanks. >> >> - Chetan Khatri >> > > >>> >> > > > -- > Best Regards, > Ayan Guha >
Re: Approach: Incremental data load from HBASE
Hi Chetan What do you mean by incremental load from HBase? There is a timestamp marker for each cell, but not at Row level. On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri wrote: > Ted Yu, > > You understood wrong, i said Incremental load from HBase to Hive, > individually you can say Incremental Import from HBase. > > On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: > >> Incremental load traditionally means generating hfiles and >> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load >> the data into hbase. >> >> For your use case, the producer needs to find rows where the flag is 0 or >> 1. >> After such rows are obtained, it is up to you how the result of >> processing is delivered to hbase. >> >> Cheers >> >> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Ok, Sure will ask. >>> >>> But what would be generic best practice solution for Incremental load >>> from HBASE. >>> >>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: >>> I haven't used Gobblin. You can consider asking Gobblin mailing list of the first option. The second option would work. On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < chetan.opensou...@gmail.com> wrote: > Hello Guys, > > I would like to understand different approach for Distributed > Incremental load from HBase, Is there any *tool / incubactor tool* which > satisfy requirement ? > > *Approach 1:* > > Write Kafka Producer and maintain manually column flag for events and > ingest it with Linkedin Gobblin to HDFS / S3. > > *Approach 2:* > > Run Scheduled Spark Job - Read from HBase and do transformations and > maintain flag column at HBase Level. > > In above both approach, I need to maintain column level flags. such as > 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will > take another 1000 rows of batch where flag is 0 or 1. > > I am looking for best practice approach with any distributed tool. > > Thanks. > > - Chetan Khatri > >>> >> > -- Best Regards, Ayan Guha
Re: Approach: Incremental data load from HBASE
Ted Yu, You understood wrong, i said Incremental load from HBase to Hive, individually you can say Incremental Import from HBase. On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: > Incremental load traditionally means generating hfiles and > using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the > data into hbase. > > For your use case, the producer needs to find rows where the flag is 0 or > 1. > After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ok, Sure will ask. >> >> But what would be generic best practice solution for Incremental load >> from HBASE. >> >> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: >> >>> I haven't used Gobblin. >>> You can consider asking Gobblin mailing list of the first option. >>> >>> The second option would work. >>> >>> >>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < >>> chetan.opensou...@gmail.com> wrote: >>> Hello Guys, I would like to understand different approach for Distributed Incremental load from HBase, Is there any *tool / incubactor tool* which satisfy requirement ? *Approach 1:* Write Kafka Producer and maintain manually column flag for events and ingest it with Linkedin Gobblin to HDFS / S3. *Approach 2:* Run Scheduled Spark Job - Read from HBase and do transformations and maintain flag column at HBase Level. In above both approach, I need to maintain column level flags. such as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will take another 1000 rows of batch where flag is 0 or 1. I am looking for best practice approach with any distributed tool. Thanks. - Chetan Khatri >>> >>> >> >
Re: Approach: Incremental data load from HBASE
Ted Correct, In my case i want Incremental Import from HBASE and Incremental load to Hive. Both approach discussed earlier with Indexing seems accurate to me. But like Sqoop support Incremental import and load for RDBMS, Is there any tool which supports Incremental import from HBase ? On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: > Incremental load traditionally means generating hfiles and > using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the > data into hbase. > > For your use case, the producer needs to find rows where the flag is 0 or > 1. > After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ok, Sure will ask. >> >> But what would be generic best practice solution for Incremental load >> from HBASE. >> >> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: >> >>> I haven't used Gobblin. >>> You can consider asking Gobblin mailing list of the first option. >>> >>> The second option would work. >>> >>> >>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < >>> chetan.opensou...@gmail.com> wrote: >>> Hello Guys, I would like to understand different approach for Distributed Incremental load from HBase, Is there any *tool / incubactor tool* which satisfy requirement ? *Approach 1:* Write Kafka Producer and maintain manually column flag for events and ingest it with Linkedin Gobblin to HDFS / S3. *Approach 2:* Run Scheduled Spark Job - Read from HBase and do transformations and maintain flag column at HBase Level. In above both approach, I need to maintain column level flags. such as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will take another 1000 rows of batch where flag is 0 or 1. I am looking for best practice approach with any distributed tool. Thanks. - Chetan Khatri >>> >>> >> >
Re: Approach: Incremental data load from HBASE
Incremental load traditionally means generating hfiles and using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the data into hbase. For your use case, the producer needs to find rows where the flag is 0 or 1. After such rows are obtained, it is up to you how the result of processing is delivered to hbase. Cheers On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri wrote: > Ok, Sure will ask. > > But what would be generic best practice solution for Incremental load from > HBASE. > > On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: > >> I haven't used Gobblin. >> You can consider asking Gobblin mailing list of the first option. >> >> The second option would work. >> >> >> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Hello Guys, >>> >>> I would like to understand different approach for Distributed >>> Incremental load from HBase, Is there any *tool / incubactor tool* which >>> satisfy requirement ? >>> >>> *Approach 1:* >>> >>> Write Kafka Producer and maintain manually column flag for events and >>> ingest it with Linkedin Gobblin to HDFS / S3. >>> >>> *Approach 2:* >>> >>> Run Scheduled Spark Job - Read from HBase and do transformations and >>> maintain flag column at HBase Level. >>> >>> In above both approach, I need to maintain column level flags. such as 0 >>> - by default, 1-sent,2-sent and acknowledged. So next time Producer will >>> take another 1000 rows of batch where flag is 0 or 1. >>> >>> I am looking for best practice approach with any distributed tool. >>> >>> Thanks. >>> >>> - Chetan Khatri >>> >> >> >
Re: Approach: Incremental data load from HBASE
Ok, Sure will ask. But what would be generic best practice solution for Incremental load from HBASE. On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: > I haven't used Gobblin. > You can consider asking Gobblin mailing list of the first option. > > The second option would work. > > > On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Guys, >> >> I would like to understand different approach for Distributed Incremental >> load from HBase, Is there any *tool / incubactor tool* which satisfy >> requirement ? >> >> *Approach 1:* >> >> Write Kafka Producer and maintain manually column flag for events and >> ingest it with Linkedin Gobblin to HDFS / S3. >> >> *Approach 2:* >> >> Run Scheduled Spark Job - Read from HBase and do transformations and >> maintain flag column at HBase Level. >> >> In above both approach, I need to maintain column level flags. such as 0 >> - by default, 1-sent,2-sent and acknowledged. So next time Producer will >> take another 1000 rows of batch where flag is 0 or 1. >> >> I am looking for best practice approach with any distributed tool. >> >> Thanks. >> >> - Chetan Khatri >> > >
Re: Approach: Incremental data load from HBASE
I haven't used Gobblin. You can consider asking Gobblin mailing list of the first option. The second option would work. On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri wrote: > Hello Guys, > > I would like to understand different approach for Distributed Incremental > load from HBase, Is there any *tool / incubactor tool* which satisfy > requirement ? > > *Approach 1:* > > Write Kafka Producer and maintain manually column flag for events and > ingest it with Linkedin Gobblin to HDFS / S3. > > *Approach 2:* > > Run Scheduled Spark Job - Read from HBase and do transformations and > maintain flag column at HBase Level. > > In above both approach, I need to maintain column level flags. such as 0 - > by default, 1-sent,2-sent and acknowledged. So next time Producer will take > another 1000 rows of batch where flag is 0 or 1. > > I am looking for best practice approach with any distributed tool. > > Thanks. > > - Chetan Khatri >