[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353142#comment-15353142 ] Narine Kokhlikyan edited comment on SPARK-12922 at 6/28/16 3:03 PM: Thank you [~timhunter] for sharing this information with us. It is a nice idea. I think that it could be seen as an extension of current gapply's implementation. I think that, in general, whether the keys are useful or not depends on the use case. Most probably, the user, naturally, would like to see the matching key of each group-output and it would make sense to attach/append the keys by default. If the user doesn't need the keys he or she can easily detach/drop those columns. was (Author: narine): Thank you [~timhunter] for sharing this information with us. It is a nice idea. I think that it could be seen as an extension of current gapply's implementation. In general, I think that whether the keys are useful or not depends on the use case. Most probably, the user, naturally, would like to see the matching key of each group-output and it would make sense to attach/append the keys by default. If the user doesn't need the keys he or she can easily detach/drop those columns. > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333125#comment-15333125 ] Narine Kokhlikyan edited comment on SPARK-12922 at 6/16/16 5:25 AM: FYI, [~olarayej], [~aloknsingh], [~vijayrb] :) was (Author: narine): FYI, [~olarayej], [~aloknsingh], [~vijayrb]! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264786#comment-15264786 ] Narine Kokhlikyan edited comment on SPARK-12922 at 4/29/16 10:01 PM: - I think that it is better to use TypedColumns. Smth similar to: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264 I don't think that there is a support for Typed columns in SparkR, is there ? In that case we could create an encoder similar to: ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], ExpressionEncoder[Double]) Is there a way to access the mapping between spark and scala type ? Like: IntegerType(spark) -> Int(scala) Thank you! was (Author: narine): I think that it is better to use TypedColumns. Smth similar to: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264 I don't think that there is a support for Typed columns in SparkR, is there ? In that case we could create an encoder similar to: ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], ExpressionEncoder[Double]) Is there a way to map spark type to scala type ? Like: IntegerType(spark) -> Int(scala) Thank you! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233886#comment-15233886 ] Narine Kokhlikyan edited comment on SPARK-12922 at 4/10/16 7:23 AM: Hi [~sunrui], I have a question regarding your suggestion about adding a new "GroupedData.flatMapRGroups" function according to the following document: https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9 It seems that some changes have happened in SparkSQL. According to 1.6.1 there was a scala class called: https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala This doesn't seem to exist in 2.0.0 I was thinking to add the flatMapRGroups helper function to org.apache.spark.sql.KeyValueGroupedDataset or org.apache.spark.sql.RelationalGroupedDataset. What do you think ? Thank you, Narine was (Author: narine): Hi [~sunrui], I have a question regarding your suggestion about adding a new "GroupedData.flatMapRGroups" function according to the following document: https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9 It seems that some changes has happened in SparkSQL. According to 1.6.1 there was a scala class called: https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala This doesn't seem to exist in 2.0.0 I was thinking to add the flatMapRGroups helper function to org.apache.spark.sql.KeyValueGroupedDataset or org.apache.spark.sql.RelationalGroupedDataset. What do you think ? Thank you, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163598#comment-15163598 ] Narine Kokhlikyan edited comment on SPARK-12922 at 2/24/16 7:48 PM: Hi [~sunrui], I looked at the implementation proposal and it looks good to me. But, I think it would be good to add some details about the aggregation of the data/dataframes which we receive from workers. I've tried to draw a diagram, for the example of group-apply in order to understand the bigger picture. https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit Please, let me know if I've understood smth wrongly ? Thanks, Narine was (Author: narine): Hi [~sunrui], I looked at the implementation proposal and it looks good to me. But, I think it would be good to add some details about the aggregation of the data/dataframes which we receive from workers. I've tried to draw a diagram, for the example of group-apply in order to get the big picture. https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit Please, let me know if I've understood smth wrongly ? Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158109#comment-15158109 ] Sun Rui edited comment on SPARK-12922 at 2/23/16 1:44 AM: -- [~Narine], yes this depends on https://issues.apache.org/jira/browse/SPARK-12792. I will do dapply() and you can feel free to work on this one by creating a working branch based on the PR for SPARK-12792. Could you review the implementation design doc before start doing? [~shivaram] Could you help to review the PR for SPARK-12792 and merge it ASAP. We might to get SparkR UDF done in Spark 2.0. was (Author: sunrui): [~Narine], yes this depends on https://issues.apache.org/jira/browse/SPARK-12792. I will do dapply() and you can feel free to work on this one by creating a working branch based on the PR for SPARK-12792. [~shivaram] Could you help to review the PR for SPARK-12792 and merge it ASAP. We might to get SparkR UDF done in Spark 2.0. > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157373#comment-15157373 ] Narine Kokhlikyan edited comment on SPARK-12922 at 2/22/16 5:47 PM: thanks, for creating this jira, [~sunrui] Have you already started to work on this ? This most probably depends on, [https://issues.apache.org/jira/browse/SPARK-12792]. We need this as soon as possible and I might start working on this. Do you have any time estimation how long will it take to get [https://issues.apache.org/jira/browse/SPARK-12792] reviewed ? cc: [~shivaram] Thanks, Narine was (Author: narine): thanks, for creating this jira, [~sunrui] Have you already started to work on this ? This most probably depends on, [https://issues.apache.org/jira/browse/SPARK-12792]. We need this as soon as possible and I might start working on this ? Do you have any time estimation how long will it take to get [https://issues.apache.org/jira/browse/SPARK-12792] reviewed ? Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org