Re: Adding features to HBase Input Operators in Malhar-contrib

Bhupesh Chawda Fri, 11 Mar 2016 03:13:50 -0800

Hi All,

In the current design of HBase input and output operators, the row key is
hard-coded to be of String type.
I foresee the following issue:


   - In case of numeric keys which are type casted to String, *incremental
   read* is problematic. For example, after reading key = 9, we may not be
   able to read any record with say, key = 8888, when though numerically 8888
   > 9, lexicographically "9" > "8888".
   - This is the case only when data is being written to HBase and being
   read from simultaneously.

My suggestion is to parametrize the type of row key in the HBase input and
output operators, and let the user instantiate the required type for row
key. We can have default implementations for String and/ or Long. By
parametrizing the row key type, the user can even use complex row keys
which are a combination of multiple fields.

Thoughts?

PS: I understand that there is a performance concern in making a
monotonically increasing key as the row key. Given that, how do we address
the incremental read scenario?

Thanks

-Bhupesh

On Wed, Dec 30, 2015 at 7:49 PM, Sandeep Deshmukh <[email protected]>
wrote:

> Looks fine to me.
>
> Regards,
> Sandeep
>
> On Wed, Dec 30, 2015 at 7:34 PM, Bhupesh Chawda <[email protected]>
> wrote:
>
> > Here is the final hierarchy I am considering:
> >
> > HBaseInputOperator - Takes care of HBaseStore and its connection. Got rid
> > of HBaseOperatorBase.
> >     HBaseScanOperator - Takes care of scanning the table in a
> non-blocking
> > manner. Exposes operationScan() and getTuple() as before.
> >         HBasePOJOInputOperator - Implements operationScan() and
> getTuple()
> > and outputs a POJO on the output port.
> >
> > Comments?
> >
> > -Bhupesh
> >
> >
> > On Wed, Dec 30, 2015 at 2:52 PM, Bhupesh Chawda <[email protected]
> >
> > wrote:
> >
> > > The class HBaseInputOperator seems to be quite old. HBaseStore seems to
> > be
> > > having all the functionality provided by HBaseInputOperator and even
> more
> > > (including Kerberos authentication).
> > >
> > > It would be a good idea to avoid the usage of HBaseInputOperator going
> > > forward and use HBaseStore instead.
> > >
> > > I will also work on abstracting out the HBase input functionality in
> the
> > > HBaseInputOperator, which can be extended by concrete implementations.
> > >
> > > -Bhupesh
> > >
> > > On Wed, Dec 23, 2015 at 7:47 PM, Bhupesh Chawda <
> [email protected]
> > >
> > > wrote:
> > >
> > >> Thanks for the inputs.
> > >> As an input operator, I am targeting just the Scan operation. Get
> > >> operation may be supported better as a generic operator (like a query
> > >> operator) which I can take up later.
> > >>
> > >> -Bhupesh
> > >>
> > >> On Tue, Dec 22, 2015 at 3:48 PM, Mohit Jotwani <[email protected]
> >
> > >> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> Regards,
> > >>> Mohit
> > >>>
> > >>> On Tue, Dec 22, 2015 at 11:21 AM, Chinmay Kolhatkar <
> > >>> [email protected]
> > >>> > wrote:
> > >>>
> > >>> > +1 for above.
> > >>> > I see that there is HbaseGetOperator but but its abstract no
> concrete
> > >>> > implementation of this I can find.
> > >>> > Are you going to implement of that too?
> > >>> >
> > >>> > Maybe the concrete implementation of HbaseGetOperator should have
> > this.
> > >>> >
> > >>> > Also, I want to mention one thing about scan from my previous
> > >>> experience of
> > >>> > Hbase. The Hbase client is synchronous.
> > >>> > This means when you fire a scan call, until certain number of
> records
> > >>> are
> > >>> > received at client end, the function blocks.
> > >>> > This causes a lot of problems in the current thread as it might
> just
> > >>> get
> > >>> > blocked for a long period of time.
> > >>> > Plus, there are always network related latency to add to the
> problem.
> > >>> >
> > >>> > Usually the way to deal with this is to fire scan like queries on a
> > >>> > separate thread and then consume the results in the main thread.
> > >>> >
> > >>> > Please take care of this scenario while implementation of scan
> > >>> operator.
> > >>> >
> > >>> > -Chinmay.
> > >>> >
> > >>> >
> > >>> > ~ Chinmay.
> > >>> >
> > >>> > On Tue, Dec 22, 2015 at 11:08 AM, Sandeep Deshmukh <
> > >>> > [email protected]>
> > >>> > wrote:
> > >>> >
> > >>> > > +1 for this Bhupesh.
> > >>> > >
> > >>> > > Additionally, I would suggest to add support for;
> > >>> > > 1. Point query
> > >>> > > 2. Returning any row version
> > >>> > >
> > >>> > > The above two are key features of HBase and should be supported.
> > >>> > >
> > >>> > > Regards,
> > >>> > > Sandeep
> > >>> > >
> > >>> > > On Fri, Dec 18, 2015 at 4:39 PM, Bhupesh Chawda <
> > >>> [email protected]
> > >>> > >
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hi All,
> > >>> > > >
> > >>> > > > The current HBasePOJOInputOperator does not allow us to do the
> > >>> > following:
> > >>> > > >
> > >>> > > >    1. Allow us to specify a set of "column family: column" and
> > >>> fetch
> > >>> > data
> > >>> > > >    only for these columns.
> > >>> > > >    2. Output format is currently a POJO. We need to have other
> > >>> output
> > >>> > > >    formats such that "columnFamily:column" representation is
> > >>> supported.
> > >>> > > > Map /
> > >>> > > >    CSV are some of the options.
> > >>> > > >    3. Allow specifying "end row-key" to stop scanning a table.
> > >>> > > >    4. No metrics.
> > >>> > > >
> > >>> > > > I am planning to add the above functionality to the HBase Input
> > >>> > > operators.
> > >>> > > > These features may go into the HBaseScanOperator /
> > >>> > > HBasePOJOInputOperator.
> > >>> > > >
> > >>> > > > Please let me know your comments.
> > >>> > > >
> > >>> > > > Thanks.
> > >>> > > >
> > >>> > > > Bhupesh
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Adding features to HBase Input Operators in Malhar-contrib

Reply via email to