[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634316#comment-14634316
 ] 

Cristian commented on SPARK-4849:
---------------------------------

I would argue that the priority for this is not Minor since if resolved it will 
enable many use cases where data can be stored in memory and queried repeatedly 
at low latency. 

For example with Spark Streaming applications, it's common to join incoming 
data with a memory resident dataset for enrichment. If that join can be 
performed without a shuffle it would enable important use cases which are 
currently too high-latency to implement with Spark.

It appears this is also a fairly straightforward fix, so any chance it can get 
some priority ?

> Pass partitioning information (distribute by) to In-memory caching
> ------------------------------------------------------------------
>
>                 Key: SPARK-4849
>                 URL: https://issues.apache.org/jira/browse/SPARK-4849
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Nitin Goyal
>            Priority: Minor
>
> HQL "distribute by <column_name>" partitions data based on specified column 
> values. We can pass this information to in-memory caching for further 
> performance improvements. e..g. in Joins, an extra partition step can be 
> saved based on this information.
> Refer - 
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to