Thanks for writing this up, Stack. Very nice read.

Snipping out some subjects to reply to directly:

On 8/19/18 6:48 AM, Stack wrote:

GITHUB
Can hbase adopt the github dev flow? Support PRs?
Its a case of just starting the discussion on the dev list?
Do we lose review/commentary information if we go github route? Brief
overview of what is possible w/ the new gitbox repos follows ultimately
answering that no, there should be no loss (github comments show as jira
comments).
Most have github but not apache accounts. PRs are easier. Could encourage
more contribution, lower the barrier to contrib.

This is something the PMC should take to heart. If we are excluding contributions because of how we choose accept them, we're limiting our own growth. Do we have technical reasons (e.g. PreCommit) which we cannot accept PR's or is it just because "we do patches because we do patches"?


PERF ACROSS VERSIONS
Lucent (lucene?) has a perf curve on home page with markings for when large
features arrived and when releases were cut so can see if increase/decrease
in perf.
There was a big slowdown going from 0.98 to 1.1.2 hbase.
We talked about doing such a perf curve on hbase home page. Would be a big
project. Asked if anyone interested?
Perhaps a dedicated cluster up on Apache. We could do a whip-around to pay
for it.

In theory, it's a great idea, but I clench up thinking about how to actually get it off the ground. Not a reason to not discuss it, but I think this is something that we would have to really think about how we implement it (*and* extract value from it). My understanding is that Mike M. continues to care&feed the Lucene impl himself, but this may be incorrect.

Another uses Apache Drill but tough when types.
Next we went over backburner items mention on previous day staring with
SQL-like access.
What about lightweight SQL support?
At Huawei... they have a project going for lightweight SQL support in hbase
based-on calcite.
For big queries, they'd go to sparksql.
Did you look at phoenix?
Phoenix is complicated, difficult. Calcite migration not done in Phoenix
(Sparksql is not calcite-based).

> An interesting idea about a facade query analyzer making transfer to
> sparksql if big query. Would need stats.

For those who didn't know, there were efforts started around Calcite in Phoenix (our Rajeshbabu and Ankit did some work, Maryanne Xu and James Taylor, too. Maybe others). This work stalled just due to the sheer quantity of work required, not necessarily for technical reasons.

I would love to see it re-invigorated. A hybrid sql system which could automatically choose Phoenix JDBC or SparkSQL (or any other related SQL systems -- Hive, Impala, etc) would be boss.

I do feel obligated to point out that the jump from a SQL implementation that supports table scans to what Phoenix does now is quite huge. For folks that don't require secondary indexing or a full SQL specification, I'm quite sure you can do something in many fewer lines of code that Phoenix with Calcite :)

Talk to phoenix project about generating a lightweight artifact. We could
help with build. One nice idea was building with a cut-down grammar, one
that removed all the "big stuff" and problematics. Could return to the user
a nice "not supported" if they try to do a 10Bx10B join.

A couple of things to unpack here:

* Phoenix client JAR size is one tangible thing that is can be further pruned (recently pruned already) based on what is possible given Hadoop and HBase artifacts

* Is standardizing on Phoenix's table structure and type serialization a good idea? (I think "yes"). I would believe that this is something that we can spin out into a separate Phoenix module "easily"

* I think having the ability in Phoenix to give a "soft" error (msft clippy-esque: are you sure you want to run that query?) is good, but it's hard to discern these except in the egregious cases. It's a hard line to toe between what is "user learning" and the "platform helping".

All of these would be good discussion to spin off to dev@phoenix.

SPARK
Be able to scan hfiles directly. Work to transfer to parquet for spark to
query.
One attendee using the replication for streaming out to parquet. Then
having spark go against that. Talk of compacting into parquet then having
spark query parquet files and for the difference between now and last
compaction, go to hbase api.

Accumulo made a "client-facing" API around its "HFile" a while back that I think turned out fairly nice: https://accumulo.apache.org/1.8/apidocs/org/apache/accumulo/core/client/rfile/package-summary.html.

Having a stable API around HFiles that is client-facing would better enable JVM-backed libraries to work with HFiles directly. (AFAIK, HFile is all private now)

Reply via email to