Thanks for writing this up, Stack. Very nice read.
Snipping out some subjects to reply to directly:
On 8/19/18 6:48 AM, Stack wrote:
GITHUB
Can hbase adopt the github dev flow? Support PRs?
Its a case of just starting the discussion on the dev list?
Do we lose review/commentary information if we go github route? Brief
overview of what is possible w/ the new gitbox repos follows ultimately
answering that no, there should be no loss (github comments show as jira
comments).
Most have github but not apache accounts. PRs are easier. Could encourage
more contribution, lower the barrier to contrib.
This is something the PMC should take to heart. If we are excluding
contributions because of how we choose accept them, we're limiting our
own growth. Do we have technical reasons (e.g. PreCommit) which we
cannot accept PR's or is it just because "we do patches because we do
patches"?
PERF ACROSS VERSIONS
Lucent (lucene?) has a perf curve on home page with markings for when large
features arrived and when releases were cut so can see if increase/decrease
in perf.
There was a big slowdown going from 0.98 to 1.1.2 hbase.
We talked about doing such a perf curve on hbase home page. Would be a big
project. Asked if anyone interested?
Perhaps a dedicated cluster up on Apache. We could do a whip-around to pay
for it.
In theory, it's a great idea, but I clench up thinking about how to
actually get it off the ground. Not a reason to not discuss it, but I
think this is something that we would have to really think about how we
implement it (*and* extract value from it). My understanding is that
Mike M. continues to care&feed the Lucene impl himself, but this may be
incorrect.
Another uses Apache Drill but tough when types.
Next we went over backburner items mention on previous day staring with
SQL-like access.
What about lightweight SQL support?
At Huawei... they have a project going for lightweight SQL support in hbase
based-on calcite.
For big queries, they'd go to sparksql.
Did you look at phoenix?
Phoenix is complicated, difficult. Calcite migration not done in Phoenix
(Sparksql is not calcite-based).
> An interesting idea about a facade query analyzer making transfer to
> sparksql if big query. Would need stats.
For those who didn't know, there were efforts started around Calcite in
Phoenix (our Rajeshbabu and Ankit did some work, Maryanne Xu and James
Taylor, too. Maybe others). This work stalled just due to the sheer
quantity of work required, not necessarily for technical reasons.
I would love to see it re-invigorated. A hybrid sql system which could
automatically choose Phoenix JDBC or SparkSQL (or any other related SQL
systems -- Hive, Impala, etc) would be boss.
I do feel obligated to point out that the jump from a SQL implementation
that supports table scans to what Phoenix does now is quite huge. For
folks that don't require secondary indexing or a full SQL specification,
I'm quite sure you can do something in many fewer lines of code that
Phoenix with Calcite :)
Talk to phoenix project about generating a lightweight artifact. We could
help with build. One nice idea was building with a cut-down grammar, one
that removed all the "big stuff" and problematics. Could return to the user
a nice "not supported" if they try to do a 10Bx10B join.
A couple of things to unpack here:
* Phoenix client JAR size is one tangible thing that is can be further
pruned (recently pruned already) based on what is possible given Hadoop
and HBase artifacts
* Is standardizing on Phoenix's table structure and type serialization a
good idea? (I think "yes"). I would believe that this is something that
we can spin out into a separate Phoenix module "easily"
* I think having the ability in Phoenix to give a "soft" error (msft
clippy-esque: are you sure you want to run that query?) is good, but
it's hard to discern these except in the egregious cases. It's a hard
line to toe between what is "user learning" and the "platform helping".
All of these would be good discussion to spin off to dev@phoenix.
SPARK
Be able to scan hfiles directly. Work to transfer to parquet for spark to
query.
One attendee using the replication for streaming out to parquet. Then
having spark go against that. Talk of compacting into parquet then having
spark query parquet files and for the difference between now and last
compaction, go to hbase api.
Accumulo made a "client-facing" API around its "HFile" a while back that
I think turned out fairly nice:
https://accumulo.apache.org/1.8/apidocs/org/apache/accumulo/core/client/rfile/package-summary.html.
Having a stable API around HFiles that is client-facing would better
enable JVM-backed libraries to work with HFiles directly. (AFAIK, HFile
is all private now)