Re: [DISCUSS] status of and plans for our hbase-spark integration

Stack Tue, 06 Nov 2018 14:24:25 -0800

I've been doing a bit of work on this old project. Added a doc of notes
that come of revisiting this effort to the HBASE-18405 issue. My first step
is moving the hbase-spark* modules out of hbase core repo to sit under
hbase-connectors (HBASE-21443). Moving out of core will help w/ some of the
items raised above.


Will be back.
S

On Wed, Jun 21, 2017 at 9:31 AM Sean Busbey <bus...@apache.org> wrote:

> Hi Folks!
>
> We've had integration with Apache Spark lingering in trunk for quite
> some time, and I'd like us to push towards firming it up. I'm going to
> try to cover a lot of ground below, so feel free to respond to just
> pieces and I'll write up a doc on things afterwards.
>
> For background, the hbase-spark module currently exists in trunk,
> branch-2, and the 2.0.0-alpha-1 release. Importantly, it has been in
> no "ready to use" release so far. It's been in master for ~2 years and
> has had a total of nearly 70 incremental changes. Right now it shows
> up in Stack’s excellent state of 2.0 doc ( https://s.apache.org/1mB4 )
> as a nice-to-have. I’d like to get some consensus on either getting it
> into release trains or officially move it out of scope for 2.0.
>
> ----
>
> 1) Branch-1 releases
>
> In July 2015 we started tracking what kind of polish was needed for
> this code to make it into our downstream facing release lines in
> HBASE-14160. Personally, I think if the module isn't ready for a
> branch-1 release than it shouldn't be in a branch-2 release either.
>
> The only things still tracked as required are some form of published
> API docs (HBASE-17766) and an IT that we can run (HBASE-18175). Our Yi
> Liang has been working on both of these, and I think we have a good
> start on them.
>
> Is there anything else we ought to be tracking here? I notice the
> umbrella "make the connector better" issue (HBASE-14789) has only
> composite row key support still open (HBASE-15335). It looks like that
> work stalled out last summer after an admirable effort by our Zhan
> Zhang. Can this wait for a future minor release?
>
> Personally, I'd like to see HBASE-17766 and HBASE-18175 closed out and
> then our existing support backported to branch-1 in time for whenever
> we get HBase 1.4 started.
>
> 2) What Spark version(s) do we care about?
>
> The hbase-spark module originally started with support for Spark 1.3.
> It currently sits at supporting just 1.6. Our Ted Yu has been
> dutifully trying to find consensus on how we handle Spark 2.0 over in
> HBASE-16179 for nearly a year.
>
> AFAICT the Spark community has no more notion of what version(s) their
> downstream users are relying on than we do. It appears that Spark 1.6
> will be their last 1.y release and at least the dev community is
> largely moving on to 2.y releases now.
>
> What version(s) do we want to handle and thus encourage our downstream
> folks to use?
>
> Just as a point of reference, Spark 1.6 doesn't have any proper
> handling of delegation tokens and our current do-it-ourselves
> workaround breaks in the presence of the support introduced in Spark
> 2.
>
> The way I see it, the options are a) ship both 1.6 and 2.y support, b)
> ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in
> branch-2. Does anyone have preferences here?
>
> Personally, I think I favor option b for simplicity, though I don't
> care for more possible delay in getting stuff out in branch-1.
> Probably option a would be best for our downstreamers.
>
> Related, while we've been going around on HBASE-16179 the Apache Spark
> community started shipping 2.1 releases and is now in the process of
> finalizing 2.2. Do we need to do anything different for these
> versions?
>
> Spark’s versioning policy suggests “not unless we want to support
> newer APIs or used alpha stuff”. But I don’t have practical experience
> with how this plays out in yet.
>
> http://spark.apache.org/versioning-policy.html
>
>
> 3) What scala version(s) do we care about?
>
> For those who aren't aware, Scala compatibility is a nightmare. Since
> Scala is still the primary language for implementation of Spark jobs,
> we have to care more about this than I'd like. (the only way out, I
> think, would be to implement our integration entirely in some other
> JVM language)
>
> The short version is that each minor version of scala (we care about)
> is mutually incompatible with all others. Right now both Spark 1.6 and
> Spark 2.y work with each of Scala 2.10 and 2.11. There's talk of
> adding support for Scala 2.12, but it will not happen until after
> Spark 2.2.
>
> (for those looking for a thread on Scala versions in Spark, I think
> this is the most recent: https://s.apache.org/IW4D )
>
> Personally, I think we serve our downstreamers best when we ship
> artifacts that work with each of the scala versions a given version of
> Spark supports. It's painful to have to do something like upgrade your
> scala version just because the storage layer you want to use requires
> a particular version. It's also painful to have to rebuild artifacts
> because that layer only offers support for the scala version you like
> as a DIY option.
>
> The happy part of this situation is that the problem, as exposed to
> us, is at a byte code level and not a source issue. So probably we can
> support multiple scala versions just by rebuilding the same source
> against different library versions.
>
> 4) Packaging all this probably will be a pain no matter what we do
>
> One of the key points of contention on HBASE-16179 is around module
> layout given X versions of Spark and Y versions of Scala.
>
> As things are in master and branch-2 now, we support exactly Spark 1.6
> on Scala 2.10. It would certainly be easiest to continue to just pick
> one Spark X and one Scala Y. Ted can correct me, but I believe the
> most recent state of HBASE-16179 does the full enumeration but only
> places a single artifact in the assembly (thus making that combination
> the blessed default). Now that we have precedent for client-specific
> libraries in the assembly (i.e. the jruby libs are kept off to the
> side and only included in classpaths that need them like the shell), I
> think we could do a better job of making sure libraries are deployed
> regardless of which spark and scala combination is present on a
> cluster.
>
> As a downstream user, I would want to make sure I can add a dependency
> to my maven project that will work for my particular spark/scala
> choice. I definitely don’t want to have to run my own nexus instance
> so that I can build my own hbase-spark client module reliably.
>
> As a release manager, I don’t want to have to run O(X * Y) builds just
> so we get the right set of maven artifacts.
>
> All of these personal opinions stated, what do others think?
>
>
> 5) Do we have the right collection of Spark API(s)?
>
> Spark has a large variety of APIs for interacting with data. Here are
> pointers to the big ones.
>
> RDDs (essentially in-memory tabular data):
> https://spark.apache.org/docs/latest/programming-guide.html
>
> Streaming (essentially a series of the above over time) :
> https://spark.apache.org/docs/latest/streaming-programming-guide.html
>
> Datasets/Dataframes (sql-oriented structured data processing that
> exposes computation info to the storage layer):
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Structured Streaming (essentially a series of the above over time):
>
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>
> Right now we have support for the first three, more or less.
> Structured Streaming is alpha as of Spark 2.1 and is expected to be GA
> for Spark 2.2.
>
> Going forward, do we want our plan to be robust support for all of
> these APIs? Would we be better off focusing solely on the newer bits
> like dataframes?
>
> 6) What about the SHC project?
>
> In case you didn’t see the excellent talk at HBaseCon from Weiqing
> Yang, she’s been maintaining a high quality integration library
> between HBase and Spark.
>
>   HBaseCon West 2017 slides: https://s.apache.org/IQMA
>   Blog: https://s.apache.org/m1bc
>   Repo: https://github.com/hortonworks-spark/shc
>
> I’d love to see us encourage the SHC devs to fold their work into
> participation in our wider community. Before approaching them about
> that, I think we need to make sure we share goals and can give them
> reasonable expectations about release cadence (which probably means
> making it into branch-1).
>
> Right now, I’d only consider the things that have made it to our docs
> to be “done”. Here’s the relevant section of the ref guide:
>
> http://hbase.apache.org/book.html#spark
>
> Comparing our current offering and the above, I’d say the big gaps
> between our offering and the SHC project are:
>
>   * Avro serialization (we have this implemented but documentation is
> limited to an example in the section on SparkSQL support)
>   * Composite keys (as mentioned above, we have a start to this)
>   * More robust handling of delegation tokens, i.e. in presence of
> multiple secure clusters
>   * Handling of Phoenix encoded data
>
> Are these all things we’d want available to our downstream folks?
>
> Personally, I think we’d serve our downstream folks well closing all
> of these gaps. I don’t think they ought to be blockers on getting our
> integration into releases; at first glance none of them look like
> they’d present compatibility issues.
>
> We’d need to figure out what to do about the phoenix encoding bit,
> dependency-wise. Ideally we’d get the phoenix folks to isolate their
> data encoding into a standalone artifact. I’m not sure how much effort
> that will be, but I’d be happy to take the suggestion over to them.
>
> ---
>
> Thanks to everyone who made it all the way down here.  That’s the end
> of what I could think of after reflecting on this for a couple of days
> (thanks to our Mike Drob for bearing the brunt of my in progress
> ramblings).
>
> I know this is a wide variety of things; again feel free to just
> respond in pieces to the sections that strike your fancy. I’ll make
> sure we have a doc with a good summary of whatever consensus we reach
> and post it here, the website, and/or JIRA once we’ve had awhile for
> folks to contribute.
>
> -busbey
>

Re: [DISCUSS] status of and plans for our hbase-spark integration

Reply via email to