[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17949:

Labels: releasenotes  (was: )

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> # For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> # For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce a JVM object based hash aggregate operator that 
> can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceeds a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-11-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17949:

Target Version/s: 2.2.0  (was: 2.1.0)

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> # For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> # For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce a JVM object based hash aggregate operator that 
> can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceeds a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Description: 
The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

# For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.
# For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce an JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceed a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.



  was:
The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.

2. For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce an JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceed a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.




> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> # For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> # For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Description: 
The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

# For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.
# For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce a JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceeds a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.



  was:
The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

# For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.
# For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce an JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceed a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.




> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> # For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> # For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce a JVM object based hash aggregate operator that 
> can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceeds a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Attachment: [Design Doc] Support for Arbitrary Aggregation States.pdf

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> 1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> 2. For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Attachment: (was: [Design Doc] Support for Arbitrary Aggregation 
States.pdf)

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> 1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> 2. For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Attachment: [Design Doc] Support for Arbitrary Aggregation States.pdf

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> 1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> 2. For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org