[jira] [Created] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

Sean Zhong (JIRA) Mon, 22 Aug 2016 08:20:36 -0700

Sean Zhong created SPARK-17187:
----------------------------------

             Summary: Support using arbitrary Java object as internal 
aggregation buffer object
                 Key: SPARK-17187
                 URL: https://issues.apache.org/jira/browse/SPARK-17187
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: Sean Zhong



*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.
2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.
3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:
1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
    a. It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
    b. We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee the performance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

Reply via email to