There a number of ways to bulk load.

There is bulk put, partition bulk put, mr bulk load, and now hbase-14150
which is spark shuffle bulk load.

Let me know if I have missed a bulk loading option.  All these r possible
with the new hbase-spark module.

As for the filter push down discussion in the past email.  U will note in
14181 that the filter push will also limit the scan range or drop scan all
together for gets.

Ted Malaska
On Aug 11, 2015 9:06 PM, "Yan Zhou.sc" <yan.zhou...@huawei.com> wrote:

> No, Astro bulkloader does not use its own shuffle. But map/reduce-side
> processing is somewhat different from HBase’s bulk loader that are used by
> many HBase apps I believe.
>
>
>
> *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
> *Sent:* Wednesday, August 12, 2015 8:56 AM
> *To:* Yan Zhou.sc
> *Cc:* dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user
> *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> The bulk load code is 14150 if u r interested.  Let me know how it can be
> made faster.
>
> It's just a spark shuffle and writing hfiles.   Unless astro wrote it's
> own shuffle the times should be very close.
>
> On Aug 11, 2015 8:49 PM, "Yan Zhou.sc" <yan.zhou...@huawei.com> wrote:
>
> Ted,
>
>
>
> Thanks for pointing out more details of HBase-14181. I am afraid I may
> still need to learn more before I can make very accurate and pointed
> comments.
>
>
>
> As for filter push down, Astro has a powerful approach to basically break
> down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
>
> to generate partition-specific predicates to be pushed down to HBase. This
> may not be a significant performance improvement if the filter logic is
> simple and/or the processing is IO-bound,
>
> but could be so for online ad-hoc analysis.
>
>
>
> For UDFs, Astro supports it both in and out of HBase custom filter.
>
>
>
> For secondary index, Astro do not support it now. With the probable
> support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
> could add this support along with its specific optimizations.
>
>
>
> For bulk load, Astro has a much faster way to load the tabular data, we
> believe.
>
>
>
> Right now, Astro’s filter pushdown is through HBase built-in filters and
> custom filter.
>
>
>
> As for HBase-14181, I see some overlaps with Astro. Both have dependences
> on Spark SQL, and both supports Spark Dataframe as an access interface,
> both supports predicate pushdown.
>
> Astro is not designed for MR (or Spark’s equivalent) access though.
>
>
>
> If HBase-14181 is shooting for access to HBase data through a subset of
> DataFrame functionalities like filter, projection, and other map-side ops,
> would it be feasible to decouple it from Spark?
>
> My understanding is that 14181 does not run Spark execution engine at all,
> but will make use of Spark Dataframe semantic and/or logic planning to pass
> a logic (sub-)plan to the HBase. If true, it might
>
> be desirable to directly support Dataframe in HBase.
>
>
>
> Thanks,
>
>
>
>
>
> *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
> *Sent:* Wednesday, August 12, 2015 7:28 AM
> *To:* Yan Zhou.sc
> *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
> *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> Hey Yan,
>
> I've been the one building out this spark functionality in hbase so maybe
> I can help clarify.
>
> The hbase-spark module is just focused on making spark integration with
> hbase easy and out of the box for both spark and spark streaming.
>
> I and I believe the hbase team has no desire to build a sql engine in
> hbase.  This jira comes the closest to that line.  The main thing here is
> filter push down logic for basic sql operation like =, >
> , and <.  User define functions and secondary indexes are not in my scope.
>
> Another main goal of hbase-spark module is to be able to allow a user to
> do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
> load.
>
> Let me know if u have any questions
>
> Ted Malaska
>
> On Aug 11, 2015 7:13 PM, "Yan Zhou.sc" <yan.zhou...@huawei.com> wrote:
>
> We have not “formally” published any numbers yet. A good reference is a
> slide deck we posted for the meetup in March.
>
> , or better yet for interested parties to run performance comparisons by
> themselves for now.
>
>
>
> As for status quo of Astro, we have been focusing on fixing bugs
> (UDF-related bug in some coprocessor/custom filter combos), and add support
> of querying string columns in HBase as integers from Astro.
>
>
>
> Thanks,
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Wednesday, August 12, 2015 7:02 AM
> *To:* Yan Zhou.sc
> *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
> *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
> "Astro"
>
>
>
> Yan:
>
> Where can I find performance numbers for Astro (it's close to middle of
> August) ?
>
>
>
> Cheers
>
>
>
> On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc <yan.zhou...@huawei.com>
> wrote:
>
> Finally I can take a look at HBASE-14181 now. Unfortunately there is no
> design doc mentioned. Superficially it is very similar to Astro with a
> difference of
>
> this being part of HBase client library; while Astro works as a Spark
> package so will evolve and function more closely with Spark SQL/Dataframe
> instead of HBase.
>
>
>
> In terms of architecture, my take is loosely-coupled query engines on top
> of KV store vs. an array of query engines supported by, and packaged as
> part of, a KV store.
>
>
>
> Functionality-wise the two could be close but Astro also supports Python
> as a result of tight integration with Spark.
>
> It will be interesting to see performance comparisons when HBase-14181 is
> ready.
>
>
>
> Thanks,
>
>
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Tuesday, August 11, 2015 3:28 PM
> *To:* Yan Zhou.sc
> *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
> *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"
>
>
>
> HBase will not have query engine.
>
>
>
> It will provide better support to query engines.
>
>
>
> Cheers
>
>
> On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc <yan.zhou...@huawei.com> wrote:
>
> Ted,
>
>
>
> I’m in China now, and seem to experience difficulty to access Apache Jira.
> Anyways, it appears to me  that HBASE-14181
> <https://issues.apache.org/jira/browse/HBASE-14181> attempts to support
> Spark DataFrame inside HBase.
>
> If true, one question to me is whether HBase is intended to have a
> built-in query engine or not. Or it will stick with the current way as
>
> a k-v store with some built-in processing capabilities in the forms of
> coprocessor, custom filter, …, etc., which allows for loosely-coupled query
> engines
>
> built on top of it.
>
>
>
> Thanks,
>
>
>
> *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com <yuzhih...@gmail.com>]
> *发送时间**:* 2015年8月11日 8:54
> *收件人**:* Bing Xiao (Bing)
> *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
> *主题**:* Re: Package Release Annoucement: Spark SQL on HBase "Astro"
>
>
>
> Yan / Bing:
>
> Mind taking a look at HBASE-14181
> <https://issues.apache.org/jira/browse/HBASE-14181> 'Add Spark DataFrame
> DataSource to HBase-Spark Module' ?
>
>
>
> Thanks
>
>
>
> On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) <bing.x...@huawei.com>
> wrote:
>
> We are happy to announce the availability of the Spark SQL on HBase 1.0.0
> release.
> http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
>
> The main features in this package, dubbed “Astro”, include:
>
> ·         Systematic and powerful handling of data pruning and
> intelligent scan, based on partial evaluation technique
>
> ·         HBase pushdown capabilities like custom filters and coprocessor
> to support ultra low latency processing
>
> ·         SQL, Data Frame support
>
> ·         More SQL capabilities made possible (Secondary index, bloom
> filter, Primary Key, Bulk load, Update)
>
> ·         Joins with data from other sources
>
> ·         Python/Java/Scala support
>
> ·         Support latest Spark 1.4.0 release
>
>
>
> The tests by Huawei team and community contributors covered the areas:
> bulk load; projection pruning; partition pruning; partial evaluation; code
> generation; coprocessor; customer filtering; DML; complex filtering on keys
> and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
> family test.  We will post the test results including performance tests the
> middle of August.
>
> You are very welcomed to try out or deploy the package, and help improve
> the integration tests with various combinations of the settings, extensive
> Data Frame tests, complex join/union test and extensive performance tests.
> Please use the “Issues” “Pull Requests” links at this package homepage, if
> you want to report bugs, improvement or feature requests.
>
> Special thanks to project owner and technical leader Yan Zhou, Huawei
> global team, community contributors and Databricks.   Databricks has been
> providing great assistance from the design to the release.
>
> “Astro”, the Spark SQL on HBase package will be useful for ultra low
> latency* query and analytics of large scale data sets in vertical
> enterprises**.* We will continue to work with the community to develop
> new features and improve code base.  Your comments and suggestions are
> greatly appreciated.
>
>
>
> Yan Zhou / Bing Xiao
>
> Huawei Big Data team
>
>
>
>
>
>
>

Reply via email to