[ 
https://issues.apache.org/jira/browse/SPARK-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Chan reopened SPARK-8133:
------------------------------

I think this is worth looking into again - for streaming.  My team is creating 
spark streaming pipelines that do aggregations.  For correctness and 
efficiency, if we can maintain a cache of current aggregation values across 
micro batches, then we can lower the load on datastores and improve performance 
- without making the batch size too big (which leads to other problems).    
Using Kafka to partition does not solve this problem because we need to do 
groupBys, sorts etc on the incoming stream, so in particular we want the sorted 
output to be "sticky" to a particular node.

Maintaining a cache or in-memory state requires "stickiness" of partitions to 
nodes.  We are exploring two avenues to do this and can contribute it back.

1) By modifying the TaskSchedulerImpl, we can avoid shuffles of tasks when 
allocating executors/workers.  This solves stickiness for clusters where the 
number of executors will not change.

2) Using a custom ShuffledRDD (or derived class) which can place the shuffled 
data partition on the same node given the same range of keys (assume 
HashPartitioner with constant number of partitions).

> sticky partitions
> -----------------
>
>                 Key: SPARK-8133
>                 URL: https://issues.apache.org/jira/browse/SPARK-8133
>             Project: Spark
>          Issue Type: New Feature
>          Components: DStreams
>    Affects Versions: 1.3.1
>            Reporter: sid
>
> We are trying to replace Apache Storm with Apache Spark streaming.
> In storm; we partitioned stream based on "Customer ID" so that msgs with a 
> range of "customer IDs" will be routed to same bolt (worker).
> We do this because each worker will cache customer details (from DB).
> So we split into 4 partitions and each bolt (worker) will have 1/4 of the 
> entire range.
> I am hoping we have a solution to this in Spark Streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to