RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan Zhou.sc Tue, 11 Aug 2015 18:07:41 -0700

No, Astro bulkloader does not use its own shuffle. But map/reduce-side 
processing is somewhat different from HBase’s bulk loader that are used by many 
HBase apps I believe.

From: Ted Malaska [mailto:[email protected]]
Sent: Wednesday, August 12, 2015 8:56 AM
To: Yan Zhou.sc
Cc: [email protected]; Ted Yu; Bing Xiao (Bing); user
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

The bulk load code is 14150 if u r interested.  Let me know how it can be made 
faster.

It's just a spark shuffle and writing hfiles.   Unless astro wrote it's own 
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, "Yan Zhou.sc" 
<[email protected]<mailto:[email protected]>> wrote:
Ted,

Thanks for pointing out more details of HBase-14181. I am afraid I may still 
need to learn more before I can make very accurate and pointed comments.

As for filter push down, Astro has a powerful approach to basically break down 
arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
to generate partition-specific predicates to be pushed down to HBase. This may 
not be a significant performance improvement if the filter logic is simple 
and/or the processing is IO-bound,
but could be so for online ad-hoc analysis.

For UDFs, Astro supports it both in and out of HBase custom filter.

For secondary index, Astro do not support it now. With the probable support by 
HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this 
support along with its specific optimizations.

For bulk load, Astro has a much faster way to load the tabular data, we believe.

Right now, Astro’s filter pushdown is through HBase built-in filters and custom 
filter.

As for HBase-14181, I see some overlaps with Astro. Both have dependences on 
Spark SQL, and both supports Spark Dataframe as an access interface, both 
supports predicate pushdown.
Astro is not designed for MR (or Spark’s equivalent) access though.

If HBase-14181 is shooting for access to HBase data through a subset of 
DataFrame functionalities like filter, projection, and other map-side ops, 
would it be feasible to decouple it from Spark?
My understanding is that 14181 does not run Spark execution engine at all, but 
will make use of Spark Dataframe semantic and/or logic planning to pass a logic 
(sub-)plan to the HBase. If true, it might
be desirable to directly support Dataframe in HBase.

Thanks,

From: Ted Malaska 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, August 12, 2015 7:28 AM
To: Yan Zhou.sc
Cc: user; [email protected]<mailto:[email protected]>; Bing Xiao (Bing); 
Ted Yu
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I can 
help clarify.

The hbase-spark module is just focused on making spark integration with hbase 
easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in hbase.  
This jira comes the closest to that line.  The main thing here is filter push 
down logic for basic sql operation like =, >
, and <.  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to do  
anything they did with MR/HBase now with Spark/Hbase.  Things like bulk load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, "Yan Zhou.sc" 
<[email protected]<mailto:[email protected]>> wrote:
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, August 12, 2015 7:02 AM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan:
Where can I find performance numbers for Astro (it's close to middle of August) 
?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
<[email protected]<mailto:[email protected]>> wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV 
store.

Functionality-wise the two could be close but Astro also supports Python as a 
result of tight integration with Spark.
It will be interesting to see performance comparisons when HBase-14181 is ready.

Thanks,

From: Ted Yu [mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, August 11, 2015 3:28 PM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

HBase will not have query engine.

It will provide better support to query engines.

Cheers

On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
<[email protected]<mailto:[email protected]>> wrote:
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181<https://issues.apache.org/jira/browse/HBASE-14181> attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:[email protected]]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan / Bing:
Mind taking a look at 
HBASE-14181<https://issues.apache.org/jira/browse/HBASE-14181> 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
<[email protected]<mailto:[email protected]>> wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

•         Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

•         HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

•         SQL, Data Frame support

•         More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

•         Joins with data from other sources

•         Python/Java/Scala support

•         Support latest Spark 1.4.0 release

The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

Reply via email to