I've been doing a bit of work on this old project. Added a doc of notes that come of revisiting this effort to the HBASE-18405 issue. My first step is moving the hbase-spark* modules out of hbase core repo to sit under hbase-connectors (HBASE-21443). Moving out of core will help w/ some of the items raised above.
Will be back. S On Wed, Jun 21, 2017 at 9:31 AM Sean Busbey <bus...@apache.org> wrote: > Hi Folks! > > We've had integration with Apache Spark lingering in trunk for quite > some time, and I'd like us to push towards firming it up. I'm going to > try to cover a lot of ground below, so feel free to respond to just > pieces and I'll write up a doc on things afterwards. > > For background, the hbase-spark module currently exists in trunk, > branch-2, and the 2.0.0-alpha-1 release. Importantly, it has been in > no "ready to use" release so far. It's been in master for ~2 years and > has had a total of nearly 70 incremental changes. Right now it shows > up in Stack’s excellent state of 2.0 doc ( https://s.apache.org/1mB4 ) > as a nice-to-have. I’d like to get some consensus on either getting it > into release trains or officially move it out of scope for 2.0. > > ---- > > 1) Branch-1 releases > > In July 2015 we started tracking what kind of polish was needed for > this code to make it into our downstream facing release lines in > HBASE-14160. Personally, I think if the module isn't ready for a > branch-1 release than it shouldn't be in a branch-2 release either. > > The only things still tracked as required are some form of published > API docs (HBASE-17766) and an IT that we can run (HBASE-18175). Our Yi > Liang has been working on both of these, and I think we have a good > start on them. > > Is there anything else we ought to be tracking here? I notice the > umbrella "make the connector better" issue (HBASE-14789) has only > composite row key support still open (HBASE-15335). It looks like that > work stalled out last summer after an admirable effort by our Zhan > Zhang. Can this wait for a future minor release? > > Personally, I'd like to see HBASE-17766 and HBASE-18175 closed out and > then our existing support backported to branch-1 in time for whenever > we get HBase 1.4 started. > > 2) What Spark version(s) do we care about? > > The hbase-spark module originally started with support for Spark 1.3. > It currently sits at supporting just 1.6. Our Ted Yu has been > dutifully trying to find consensus on how we handle Spark 2.0 over in > HBASE-16179 for nearly a year. > > AFAICT the Spark community has no more notion of what version(s) their > downstream users are relying on than we do. It appears that Spark 1.6 > will be their last 1.y release and at least the dev community is > largely moving on to 2.y releases now. > > What version(s) do we want to handle and thus encourage our downstream > folks to use? > > Just as a point of reference, Spark 1.6 doesn't have any proper > handling of delegation tokens and our current do-it-ourselves > workaround breaks in the presence of the support introduced in Spark > 2. > > The way I see it, the options are a) ship both 1.6 and 2.y support, b) > ship just 2.y support, c) ship 1.6 in branch-1 and ship 2.y in > branch-2. Does anyone have preferences here? > > Personally, I think I favor option b for simplicity, though I don't > care for more possible delay in getting stuff out in branch-1. > Probably option a would be best for our downstreamers. > > Related, while we've been going around on HBASE-16179 the Apache Spark > community started shipping 2.1 releases and is now in the process of > finalizing 2.2. Do we need to do anything different for these > versions? > > Spark’s versioning policy suggests “not unless we want to support > newer APIs or used alpha stuff”. But I don’t have practical experience > with how this plays out in yet. > > http://spark.apache.org/versioning-policy.html > > > 3) What scala version(s) do we care about? > > For those who aren't aware, Scala compatibility is a nightmare. Since > Scala is still the primary language for implementation of Spark jobs, > we have to care more about this than I'd like. (the only way out, I > think, would be to implement our integration entirely in some other > JVM language) > > The short version is that each minor version of scala (we care about) > is mutually incompatible with all others. Right now both Spark 1.6 and > Spark 2.y work with each of Scala 2.10 and 2.11. There's talk of > adding support for Scala 2.12, but it will not happen until after > Spark 2.2. > > (for those looking for a thread on Scala versions in Spark, I think > this is the most recent: https://s.apache.org/IW4D ) > > Personally, I think we serve our downstreamers best when we ship > artifacts that work with each of the scala versions a given version of > Spark supports. It's painful to have to do something like upgrade your > scala version just because the storage layer you want to use requires > a particular version. It's also painful to have to rebuild artifacts > because that layer only offers support for the scala version you like > as a DIY option. > > The happy part of this situation is that the problem, as exposed to > us, is at a byte code level and not a source issue. So probably we can > support multiple scala versions just by rebuilding the same source > against different library versions. > > 4) Packaging all this probably will be a pain no matter what we do > > One of the key points of contention on HBASE-16179 is around module > layout given X versions of Spark and Y versions of Scala. > > As things are in master and branch-2 now, we support exactly Spark 1.6 > on Scala 2.10. It would certainly be easiest to continue to just pick > one Spark X and one Scala Y. Ted can correct me, but I believe the > most recent state of HBASE-16179 does the full enumeration but only > places a single artifact in the assembly (thus making that combination > the blessed default). Now that we have precedent for client-specific > libraries in the assembly (i.e. the jruby libs are kept off to the > side and only included in classpaths that need them like the shell), I > think we could do a better job of making sure libraries are deployed > regardless of which spark and scala combination is present on a > cluster. > > As a downstream user, I would want to make sure I can add a dependency > to my maven project that will work for my particular spark/scala > choice. I definitely don’t want to have to run my own nexus instance > so that I can build my own hbase-spark client module reliably. > > As a release manager, I don’t want to have to run O(X * Y) builds just > so we get the right set of maven artifacts. > > All of these personal opinions stated, what do others think? > > > 5) Do we have the right collection of Spark API(s)? > > Spark has a large variety of APIs for interacting with data. Here are > pointers to the big ones. > > RDDs (essentially in-memory tabular data): > https://spark.apache.org/docs/latest/programming-guide.html > > Streaming (essentially a series of the above over time) : > https://spark.apache.org/docs/latest/streaming-programming-guide.html > > Datasets/Dataframes (sql-oriented structured data processing that > exposes computation info to the storage layer): > https://spark.apache.org/docs/latest/sql-programming-guide.html > > Structured Streaming (essentially a series of the above over time): > > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html > > Right now we have support for the first three, more or less. > Structured Streaming is alpha as of Spark 2.1 and is expected to be GA > for Spark 2.2. > > Going forward, do we want our plan to be robust support for all of > these APIs? Would we be better off focusing solely on the newer bits > like dataframes? > > 6) What about the SHC project? > > In case you didn’t see the excellent talk at HBaseCon from Weiqing > Yang, she’s been maintaining a high quality integration library > between HBase and Spark. > > HBaseCon West 2017 slides: https://s.apache.org/IQMA > Blog: https://s.apache.org/m1bc > Repo: https://github.com/hortonworks-spark/shc > > I’d love to see us encourage the SHC devs to fold their work into > participation in our wider community. Before approaching them about > that, I think we need to make sure we share goals and can give them > reasonable expectations about release cadence (which probably means > making it into branch-1). > > Right now, I’d only consider the things that have made it to our docs > to be “done”. Here’s the relevant section of the ref guide: > > http://hbase.apache.org/book.html#spark > > Comparing our current offering and the above, I’d say the big gaps > between our offering and the SHC project are: > > * Avro serialization (we have this implemented but documentation is > limited to an example in the section on SparkSQL support) > * Composite keys (as mentioned above, we have a start to this) > * More robust handling of delegation tokens, i.e. in presence of > multiple secure clusters > * Handling of Phoenix encoded data > > Are these all things we’d want available to our downstream folks? > > Personally, I think we’d serve our downstream folks well closing all > of these gaps. I don’t think they ought to be blockers on getting our > integration into releases; at first glance none of them look like > they’d present compatibility issues. > > We’d need to figure out what to do about the phoenix encoding bit, > dependency-wise. Ideally we’d get the phoenix folks to isolate their > data encoding into a standalone artifact. I’m not sure how much effort > that will be, but I’d be happy to take the suggestion over to them. > > --- > > Thanks to everyone who made it all the way down here. That’s the end > of what I could think of after reflecting on this for a couple of days > (thanks to our Mike Drob for bearing the brunt of my in progress > ramblings). > > I know this is a wide variety of things; again feel free to just > respond in pieces to the sections that strike your fancy. I’ll make > sure we have a doc with a good summary of whatever consensus we reach > and post it here, the website, and/or JIRA once we’ve had awhile for > folks to contribute. > > -busbey >