[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2015-11-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990050#comment-14990050
 ] 

Yin Huai commented on SPARK-4849:
-

oh, actually SPARK-5354. SPARK-11410 adds APIs to specify how to shuffle data.

> Pass partitioning information (distribute by) to In-memory caching
> --
>
> Key: SPARK-4849
> URL: https://issues.apache.org/jira/browse/SPARK-4849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Goyal
>Assignee: Nong Li
>Priority: Critical
> Fix For: 1.6.0
>
>
> HQL "distribute by " partitions data based on specified column 
> values. We can pass this information to in-memory caching for further 
> performance improvements. e..g. in Joins, an extra partition step can be 
> saved based on this information.
> Refer - 
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2015-11-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990040#comment-14990040
 ] 

Reynold Xin commented on SPARK-4849:


This has been fixed in SPARK-11410.


> Pass partitioning information (distribute by) to In-memory caching
> --
>
> Key: SPARK-4849
> URL: https://issues.apache.org/jira/browse/SPARK-4849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Goyal
>Assignee: Nong Li
>Priority: Critical
> Fix For: 1.6.0
>
>
> HQL "distribute by " partitions data based on specified column 
> values. We can pass this information to in-memory caching for further 
> performance improvements. e..g. in Joins, an extra partition step can be 
> saved based on this information.
> Refer - 
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2015-07-20 Thread Cristian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634316#comment-14634316
 ] 

Cristian commented on SPARK-4849:
-

I would argue that the priority for this is not Minor since if resolved it will 
enable many use cases where data can be stored in memory and queried repeatedly 
at low latency. 

For example with Spark Streaming applications, it's common to join incoming 
data with a memory resident dataset for enrichment. If that join can be 
performed without a shuffle it would enable important use cases which are 
currently too high-latency to implement with Spark.

It appears this is also a fairly straightforward fix, so any chance it can get 
some priority ?

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2015-06-19 Thread Eric Pederson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594193#comment-14594193
 ] 

Eric Pederson commented on SPARK-4849:
--

Does this also apply to in-memory tables created as the result of cached 
partitioned hive tables?

For example, say {{hivetable}} is partitioned by {{(s string, c string)}}.

{code}
val sql = new HiveContext(sc)
val t = sql.table(hivetable)
val c1 = t.cached()

val f1 = c1.filter(s = 'FNM30' and c = '3.0')
val s1 = f1.groupBy(g).sum(a, b, c)
{code}

Should it be able to prune parts {{c1}} because of the original paritioning?

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247184#comment-14247184
 ] 

Michael Armbrust commented on SPARK-4849:
-

The trick here will be to make sure that the outputPartitioning is correctly 
output from the InMemoryColumnarTableScan.

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org