[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990050#comment-14990050 ] Yin Huai commented on SPARK-4849: - oh, actually SPARK-5354. SPARK-11410 adds APIs to specify how to shuffle data. > Pass partitioning information (distribute by) to In-memory caching > -- > > Key: SPARK-4849 > URL: https://issues.apache.org/jira/browse/SPARK-4849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: Nitin Goyal >Assignee: Nong Li >Priority: Critical > Fix For: 1.6.0 > > > HQL "distribute by " partitions data based on specified column > values. We can pass this information to in-memory caching for further > performance improvements. e..g. in Joins, an extra partition step can be > saved based on this information. > Refer - > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990040#comment-14990040 ] Reynold Xin commented on SPARK-4849: This has been fixed in SPARK-11410. > Pass partitioning information (distribute by) to In-memory caching > -- > > Key: SPARK-4849 > URL: https://issues.apache.org/jira/browse/SPARK-4849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: Nitin Goyal >Assignee: Nong Li >Priority: Critical > Fix For: 1.6.0 > > > HQL "distribute by " partitions data based on specified column > values. We can pass this information to in-memory caching for further > performance improvements. e..g. in Joins, an extra partition step can be > saved based on this information. > Refer - > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634316#comment-14634316 ] Cristian commented on SPARK-4849: - I would argue that the priority for this is not Minor since if resolved it will enable many use cases where data can be stored in memory and queried repeatedly at low latency. For example with Spark Streaming applications, it's common to join incoming data with a memory resident dataset for enrichment. If that join can be performed without a shuffle it would enable important use cases which are currently too high-latency to implement with Spark. It appears this is also a fairly straightforward fix, so any chance it can get some priority ? Pass partitioning information (distribute by) to In-memory caching -- Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594193#comment-14594193 ] Eric Pederson commented on SPARK-4849: -- Does this also apply to in-memory tables created as the result of cached partitioned hive tables? For example, say {{hivetable}} is partitioned by {{(s string, c string)}}. {code} val sql = new HiveContext(sc) val t = sql.table(hivetable) val c1 = t.cached() val f1 = c1.filter(s = 'FNM30' and c = '3.0') val s1 = f1.groupBy(g).sum(a, b, c) {code} Should it be able to prune parts {{c1}} because of the original paritioning? Pass partitioning information (distribute by) to In-memory caching -- Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247184#comment-14247184 ] Michael Armbrust commented on SPARK-4849: - The trick here will be to make sure that the outputPartitioning is correctly output from the InMemoryColumnarTableScan. Pass partitioning information (distribute by) to In-memory caching -- Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org