Re: [DISCUSS] status of and plans for our hbase-spark integration

Sean Busbey Fri, 23 Jun 2017 10:31:26 -0700

On Fri, Jun 23, 2017 at 12:06 PM, Stack <[email protected]> wrote:
> On Wed, Jun 21, 2017 at 9:31 AM, Sean Busbey <[email protected]> wrote:
>
>
> 2) What Spark version(s) do we care about?
>> ...
>>
>> What version(s) do we want to handle and thus encourage our downstream
>> folks to use?
>>
>  ..
>
>> Personally, I think I favor option b for simplicity, though I don't
>> care for more possible delay in getting stuff out in branch-1.
>> Probably option a would be best for our downstreamers.
>>
>>
> Lets do option b.) well. If demand and contribs, lets consider adding 1.6
> support.
>


This is the chorus I'm hearing. :)



>
>> 6) What about the SHC project?
>>
>> In case you didn’t see the excellent talk at HBaseCon from Weiqing
>> Yang, she’s been maintaining a high quality integration library
>> between HBase and Spark.
>>
>>   HBaseCon West 2017 slides: https://s.apache.org/IQMA
>>   Blog: https://s.apache.org/m1bc
>>   Repo: https://github.com/hortonworks-spark/shc
>>
>> I’d love to see us encourage the SHC devs to fold their work into
>> participation in our wider community. Before approaching them about
>> that, I think we need to make sure we share goals and can give them
>> reasonable expectations about release cadence (which probably means
>> making it into branch-1).
>>
>
> I pinged Weiqing; my guess is she has an opinion on your swath here.
>

Been going for a few days here and no obvious points of contention;
would love to get her take on things.

>
>>
>> Right now, I’d only consider the things that have made it to our docs
>> to be “done”. Here’s the relevant section of the ref guide:
>>
>> http://hbase.apache.org/book.html#spark
>>
>> Comparing our current offering and the above, I’d say the big gaps
>> between our offering and the SHC project are:
>>
>>   * Avro serialization (we have this implemented but documentation is
>> limited to an example in the section on SparkSQL support)
>>   * Composite keys (as mentioned above, we have a start to this)
>>   * More robust handling of delegation tokens, i.e. in presence of
>> multiple secure clusters
>>   * Handling of Phoenix encoded data
>>
>> Are these all things we’d want available to our downstream folks?
>>
>>
> I don't know enough about the integration but is the 'handling of Phoenix
> encoded data' about mapping spark types to a serialization in hbase? If
> not, where is the need for seamless transforms between spark types and a
> natural hbase serialization listed. We need this IIRC.
>

It's a subtask, really. We already have a pluggable system for mapping
between spark types and a couple of serialization options (the docs
need improvement?).

SHC has its own pluggable system and has the addition of a phoenix
encoding. The set seems like the most likely out-of-the-box formats
folks might have something in. (I thinkMaybe Kite? I think it's
different than the rest.)

Or are you saying we can just map all of it the the hbase-common
"types" and then do the pluggable part under it?

Re: [DISCUSS] status of and plans for our hbase-spark integration

Reply via email to