[GitHub] spark pull request #14753: [SPARK-17187][SQL] Supports using arbitrary Java ...

yhuai Mon, 22 Aug 2016 16:14:10 -0700

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14753#discussion_r75776503
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
 ---
    @@ -389,3 +389,148 @@ abstract class DeclarativeAggregate
         def right: AttributeReference = 
inputAggBufferAttributes(aggBufferAttributes.indexOf(a))
       }
     }
    +
    +/**
    + * Aggregation function which allows **arbitrary** user-defined java 
object to be used as internal
    + * aggregation buffer object.
    + *
    + * {{{
    + *                aggregation buffer for normal aggregation function `avg`
    + *                    |
    + *                    v
    + *                  
+--------------+---------------+-----------------------------------+
    + *                  |  sum1 (Long) | count1 (Long) | generic user-defined 
java objects |
    + *                  
+--------------+---------------+-----------------------------------+
    + *                                                     ^
    + *                                                     |
    + *                    Aggregation buffer object for 
`TypedImperativeAggregate` aggregation function
    + * }}}
    + *
    + * Work flow (Partial mode aggregate at Mapper side, and Final mode 
aggregate at Reducer side):
    + *
    + * Stage 1: Partial aggregate at Mapper side:
    + *
    + *  1. The framework calls `createAggregationBuffer(): T` to create an 
empty internal aggregation
    + *     buffer object.
    + *  2. Upon each input row, the framework calls
    + *     `update(buffer: T, input: InternalRow): Unit` to update the 
aggregation buffer object T.
    + *  3. After processing all rows of current group (group by key), the 
framework will serialize
    + *     aggregation buffer object T to SparkSQL internally supported 
underlying storage format, and
    + *     persist the serializable format to disk if needed.
    + *  4. The framework moves on to next group, until all groups have been 
processed.
    + *
    + * Shuffling exchange data to Reducer tasks...
    + *
    + * Stage 2: Final mode aggregate at Reducer side:
    + *
    + *  1. The framework calls `createAggregationBuffer(): T` to create an 
empty internal aggregation
    + *     buffer object (type T) for merging.
    + *  2. For each aggregation output of Stage 1, The framework de-serializes 
the storage
    + *     format and generates one input aggregation object (type T).
    + *  3. For each input aggregation object, the framework calls 
`merge(buffer: T, input: T): Unit`
    + *     to merge the input aggregation object into aggregation buffer 
object.
    + *  4. After processing all input aggregation objects of current group 
(group by key), the framework
    + *     calls method `eval(buffer: T)` to generate the final output for 
this group.
    + *  5. The framework moves on to next group, until all groups have been 
processed.
    + */
    +abstract class TypedImperativeAggregate[T >: Null] extends 
ImperativeAggregate {
    +
    +  /**
    +   * Spark Sql type of user-defined aggregation buffer object. It needs to 
be an `UserDefinedType`
    +   * so that the framework knows how to serialize the aggregation buffer 
object to Spark sql
    +   * internally supported storage format.
    +   */
    +  def aggregationBufferType: UserDefinedType[T]
    --- End diff --
    
    Let's not use UDT.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14753: [SPARK-17187][SQL] Supports using arbitrary Java ...

Reply via email to