The bulk load code is 14150 if u r interested. Let me know how it can be made faster.
It's just a spark shuffle and writing hfiles. Unless astro wrote it's own shuffle the times should be very close. On Aug 11, 2015 8:49 PM, "Yan Zhou.sc" <yan.zhou...@huawei.com> wrote: > Ted, > > > > Thanks for pointing out more details of HBase-14181. I am afraid I may > still need to learn more before I can make very accurate and pointed > comments. > > > > As for filter push down, Astro has a powerful approach to basically break > down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT > > to generate partition-specific predicates to be pushed down to HBase. This > may not be a significant performance improvement if the filter logic is > simple and/or the processing is IO-bound, > > but could be so for online ad-hoc analysis. > > > > For UDFs, Astro supports it both in and out of HBase custom filter. > > > > For secondary index, Astro do not support it now. With the probable > support by HBase in the future(thanks to Ted Yu’s comments a while ago), we > could add this support along with its specific optimizations. > > > > For bulk load, Astro has a much faster way to load the tabular data, we > believe. > > > > Right now, Astro’s filter pushdown is through HBase built-in filters and > custom filter. > > > > As for HBase-14181, I see some overlaps with Astro. Both have dependences > on Spark SQL, and both supports Spark Dataframe as an access interface, > both supports predicate pushdown. > > Astro is not designed for MR (or Spark’s equivalent) access though. > > > > If HBase-14181 is shooting for access to HBase data through a subset of > DataFrame functionalities like filter, projection, and other map-side ops, > would it be feasible to decouple it from Spark? > > My understanding is that 14181 does not run Spark execution engine at all, > but will make use of Spark Dataframe semantic and/or logic planning to pass > a logic (sub-)plan to the HBase. If true, it might > > be desirable to directly support Dataframe in HBase. > > > > Thanks, > > > > > > *From:* Ted Malaska [mailto:ted.mala...@cloudera.com] > *Sent:* Wednesday, August 12, 2015 7:28 AM > *To:* Yan Zhou.sc > *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu > *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase > "Astro" > > > > Hey Yan, > > I've been the one building out this spark functionality in hbase so maybe > I can help clarify. > > The hbase-spark module is just focused on making spark integration with > hbase easy and out of the box for both spark and spark streaming. > > I and I believe the hbase team has no desire to build a sql engine in > hbase. This jira comes the closest to that line. The main thing here is > filter push down logic for basic sql operation like =, > > , and <. User define functions and secondary indexes are not in my scope. > > Another main goal of hbase-spark module is to be able to allow a user to > do anything they did with MR/HBase now with Spark/Hbase. Things like bulk > load. > > Let me know if u have any questions > > Ted Malaska > > On Aug 11, 2015 7:13 PM, "Yan Zhou.sc" <yan.zhou...@huawei.com> wrote: > > We have not “formally” published any numbers yet. A good reference is a > slide deck we posted for the meetup in March. > > , or better yet for interested parties to run performance comparisons by > themselves for now. > > > > As for status quo of Astro, we have been focusing on fixing bugs > (UDF-related bug in some coprocessor/custom filter combos), and add support > of querying string columns in HBase as integers from Astro. > > > > Thanks, > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* Wednesday, August 12, 2015 7:02 AM > *To:* Yan Zhou.sc > *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org > *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase > "Astro" > > > > Yan: > > Where can I find performance numbers for Astro (it's close to middle of > August) ? > > > > Cheers > > > > On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc <yan.zhou...@huawei.com> > wrote: > > Finally I can take a look at HBASE-14181 now. Unfortunately there is no > design doc mentioned. Superficially it is very similar to Astro with a > difference of > > this being part of HBase client library; while Astro works as a Spark > package so will evolve and function more closely with Spark SQL/Dataframe > instead of HBase. > > > > In terms of architecture, my take is loosely-coupled query engines on top > of KV store vs. an array of query engines supported by, and packaged as > part of, a KV store. > > > > Functionality-wise the two could be close but Astro also supports Python > as a result of tight integration with Spark. > > It will be interesting to see performance comparisons when HBase-14181 is > ready. > > > > Thanks, > > > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* Tuesday, August 11, 2015 3:28 PM > *To:* Yan Zhou.sc > *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org > *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro" > > > > HBase will not have query engine. > > > > It will provide better support to query engines. > > > > Cheers > > > On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc <yan.zhou...@huawei.com> wrote: > > Ted, > > > > I’m in China now, and seem to experience difficulty to access Apache Jira. > Anyways, it appears to me that HBASE-14181 > <https://issues.apache.org/jira/browse/HBASE-14181> attempts to support > Spark DataFrame inside HBase. > > If true, one question to me is whether HBase is intended to have a > built-in query engine or not. Or it will stick with the current way as > > a k-v store with some built-in processing capabilities in the forms of > coprocessor, custom filter, …, etc., which allows for loosely-coupled query > engines > > built on top of it. > > > > Thanks, > > > > *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com <yuzhih...@gmail.com>] > *发送时间**:* 2015年8月11日 8:54 > *收件人**:* Bing Xiao (Bing) > *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc > *主题**:* Re: Package Release Annoucement: Spark SQL on HBase "Astro" > > > > Yan / Bing: > > Mind taking a look at HBASE-14181 > <https://issues.apache.org/jira/browse/HBASE-14181> 'Add Spark DataFrame > DataSource to HBase-Spark Module' ? > > > > Thanks > > > > On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) <bing.x...@huawei.com> > wrote: > > We are happy to announce the availability of the Spark SQL on HBase 1.0.0 > release. > http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase > > The main features in this package, dubbed “Astro”, include: > > · Systematic and powerful handling of data pruning and > intelligent scan, based on partial evaluation technique > > · HBase pushdown capabilities like custom filters and coprocessor > to support ultra low latency processing > > · SQL, Data Frame support > > · More SQL capabilities made possible (Secondary index, bloom > filter, Primary Key, Bulk load, Update) > > · Joins with data from other sources > > · Python/Java/Scala support > > · Support latest Spark 1.4.0 release > > > > The tests by Huawei team and community contributors covered the areas: > bulk load; projection pruning; partition pruning; partial evaluation; code > generation; coprocessor; customer filtering; DML; complex filtering on keys > and non-keys; Join/union with non-Hbase data; Data Frame; multi-column > family test. We will post the test results including performance tests the > middle of August. > > You are very welcomed to try out or deploy the package, and help improve > the integration tests with various combinations of the settings, extensive > Data Frame tests, complex join/union test and extensive performance tests. > Please use the “Issues” “Pull Requests” links at this package homepage, if > you want to report bugs, improvement or feature requests. > > Special thanks to project owner and technical leader Yan Zhou, Huawei > global team, community contributors and Databricks. Databricks has been > providing great assistance from the design to the release. > > “Astro”, the Spark SQL on HBase package will be useful for ultra low > latency* query and analytics of large scale data sets in vertical > enterprises**.* We will continue to work with the community to develop > new features and improve code base. Your comments and suggestions are > greatly appreciated. > > > > Yan Zhou / Bing Xiao > > Huawei Big Data team > > > > > > >