Re: Rough notes from dev meetup, day after hbaseconasia 2018, saturday morning

Josh Elser Mon, 20 Aug 2018 09:34:00 -0700

Thanks for writing this up, Stack. Very nice read.

Snipping out some subjects to reply to directly:


On 8/19/18 6:48 AM, Stack wrote:

GITHUB
Can hbase adopt the github dev flow? Support PRs?
Its a case of just starting the discussion on the dev list?
Do we lose review/commentary information if we go github route? Brief
overview of what is possible w/ the new gitbox repos follows ultimately
answering that no, there should be no loss (github comments show as jira
comments).
Most have github but not apache accounts. PRs are easier. Could encourage
more contribution, lower the barrier to contrib.

This is something the PMC should take to heart. If we are excludingcontributions because of how we choose accept them, we're limiting ourown growth. Do we have technical reasons (e.g. PreCommit) which wecannot accept PR's or is it just because "we do patches because we dopatches"?


PERF ACROSS VERSIONS
Lucent (lucene?) has a perf curve on home page with markings for when large
features arrived and when releases were cut so can see if increase/decrease
in perf.
There was a big slowdown going from 0.98 to 1.1.2 hbase.
We talked about doing such a perf curve on hbase home page. Would be a big
project. Asked if anyone interested?
Perhaps a dedicated cluster up on Apache. We could do a whip-around to pay
for it.

In theory, it's a great idea, but I clench up thinking about how toactually get it off the ground. Not a reason to not discuss it, but Ithink this is something that we would have to really think about how weimplement it (*and* extract value from it). My understanding is thatMike M. continues to care&feed the Lucene impl himself, but this may beincorrect.

Another uses Apache Drill but tough when types.
Next we went over backburner items mention on previous day staring with
SQL-like access.
What about lightweight SQL support?
At Huawei... they have a project going for lightweight SQL support in hbase
based-on calcite.
For big queries, they'd go to sparksql.
Did you look at phoenix?
Phoenix is complicated, difficult. Calcite migration not done in Phoenix
(Sparksql is not calcite-based).


> An interesting idea about a facade query analyzer making transfer to
> sparksql if big query. Would need stats.

For those who didn't know, there were efforts started around Calcite inPhoenix (our Rajeshbabu and Ankit did some work, Maryanne Xu and JamesTaylor, too. Maybe others). This work stalled just due to the sheerquantity of work required, not necessarily for technical reasons.

I would love to see it re-invigorated. A hybrid sql system which couldautomatically choose Phoenix JDBC or SparkSQL (or any other related SQLsystems -- Hive, Impala, etc) would be boss.

I do feel obligated to point out that the jump from a SQL implementationthat supports table scans to what Phoenix does now is quite huge. Forfolks that don't require secondary indexing or a full SQL specification,I'm quite sure you can do something in many fewer lines of code thatPhoenix with Calcite :)

Talk to phoenix project about generating a lightweight artifact. We could
help with build. One nice idea was building with a cut-down grammar, one
that removed all the "big stuff" and problematics. Could return to the user
a nice "not supported" if they try to do a 10Bx10B join.


A couple of things to unpack here:

* Phoenix client JAR size is one tangible thing that is can be furtherpruned (recently pruned already) based on what is possible given Hadoopand HBase artifacts

* Is standardizing on Phoenix's table structure and type serialization agood idea? (I think "yes"). I would believe that this is something thatwe can spin out into a separate Phoenix module "easily"

* I think having the ability in Phoenix to give a "soft" error (msftclippy-esque: are you sure you want to run that query?) is good, butit's hard to discern these except in the egregious cases. It's a hardline to toe between what is "user learning" and the "platform helping".


All of these would be good discussion to spin off to dev@phoenix.

SPARK
Be able to scan hfiles directly. Work to transfer to parquet for spark to
query.
One attendee using the replication for streaming out to parquet. Then
having spark go against that. Talk of compacting into parquet then having
spark query parquet files and for the difference between now and last
compaction, go to hbase api.

Accumulo made a "client-facing" API around its "HFile" a while back thatI think turned out fairly nice:https://accumulo.apache.org/1.8/apidocs/org/apache/accumulo/core/client/rfile/package-summary.html.

Having a stable API around HFiles that is client-facing would betterenable JVM-backed libraries to work with HFiles directly. (AFAIK, HFileis all private now)

Re: Rough notes from dev meetup, day after hbaseconasia 2018, saturday morning

Reply via email to