[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122291#comment-15122291 ] Herman van Hovell commented on SPARK-9301: -- You could implement this as an {{ImperativeAggregate}} make sure it does not support partial aggregation (override {{supportsPartial}}) and maintain state in the class itself. Look at {{org.apache.spark.sql.hive.HiveUDAFFunction}} for an example. It won't be quick but it should work (as long as the size of the size of the collection doesn't cause OOMEs). > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Nick Buroojy >Priority: Critical > Fix For: 1.6.0 > > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121968#comment-15121968 ] Cristian commented on SPARK-9301: - Seconded, looks like MutableAggregationBuffer is not so mutable after all, everything gets converted to catalyst types and back everytime, which makes it impossible to implement anything that collects a larger amount of data to evaluate later. > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Nick Buroojy >Priority: Critical > Fix For: 1.6.0 > > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121476#comment-15121476 ] Justin Uang commented on SPARK-9301: Yea, my workaround has been json'ifying the struct into a string first, then doing the aggregate, then unpacking it, which is obviously very unideal. Also, using Hive makes my unit tests take 25 seconds to start up, instead of 3 seconds. > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Nick Buroojy >Priority: Critical > Fix For: 1.6.0 > > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121473#comment-15121473 ] Maciej BryĆski commented on SPARK-9301: --- Moreover version from Hive doesn't work with struct types. https://issues.apache.org/jira/browse/SPARK-10605 > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Nick Buroojy >Priority: Critical > Fix For: 1.6.0 > > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121275#comment-15121275 ] Justin Uang commented on SPARK-9301: Do we have a plan on how to implement these in native spark sql? I imagine that this code will have terrible performance implications, since every time we do update(), we're probably doing a full copy of the array/seq. {code} class MyUDAF extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(List(StructField("input", StringType))) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer.update(0, input.get(0) +: buffer.getSeq(0)) } override def bufferSchema: StructType = StructType(List(StructField("list", ArrayType(StringType override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1.update(0, buffer1.getSeq(0) ++ buffer2.getSeq(0)) } override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer.update(0, Array()) } override def deterministic: Boolean = true override def evaluate(buffer: Row): Any = { buffer.get(0) } override def dataType: DataType = ArrayType(StringType) } {code} > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Nick Buroojy >Priority: Critical > Fix For: 1.6.0 > > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994262#comment-14994262 ] Apache Spark commented on SPARK-9301: - User 'nburoojy' has created a pull request for this issue: https://github.com/apache/spark/pull/9526 > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900823#comment-14900823 ] Nick Buroojy commented on SPARK-9301: - I sent a pull request to add these aggregates on the new api; however, I now see that this may be blocked by SPARK-9830 (https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14728451&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14728451). Let me know if the next step on this is to wait for the blocking change. > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729852#comment-14729852 ] Apache Spark commented on SPARK-9301: - User 'nburoojy' has created a pull request for this issue: https://github.com/apache/spark/pull/8592 > collect_set and collect_list aggregate functions > > > Key: SPARK-9301 > URL: https://issues.apache.org/jira/browse/SPARK-9301 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > A short introduction on how to build aggregate functions based on our new > interface can be found at > https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org