[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

Sean Zhong (JIRA) Mon, 22 Aug 2016 08:22:29 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Zhong updated SPARK-17187:
-------------------------------
    Description: 
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
     -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
     -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee the    performance.

2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.

3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.

  was:
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
     -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
     -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee the    performance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.


> Support using arbitrary Java object as internal aggregation buffer object
> -------------------------------------------------------------------------
>
>                 Key: SPARK-17187
>                 URL: https://issues.apache.org/jira/browse/SPARK-17187
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>      -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>      -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee the    performance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

Reply via email to