Using your own partitioner didn't work? e.g. YourRDD.partitionBy(new HashPartitioner(your number))
xj @ Tokyo On Wed, Sep 10, 2014 at 12:03 PM, qihong <qc...@pivotal.io> wrote: > I'm working on a DStream application. The input are sensors' measurements, > the data format is <sensor id><timestamp><measure> > > There are 10 thousands sensors, and updateStateByKey is used to maintain > the states of sensors, the code looks like following: > > val inputDStream = ... > val keyedDStream = inputDStream.map(...) // use sensorId as key > val stateDStream = keyedDStream.updateStateByKey[...](udpateFunction) > > Here's the question: > In a cluster with 10 worker nodes, is it possible to partition the input > dstream, so that node 1 handles sendor 0-999, node 2 handles 1000-1999, > and so on? > > Also, is it possible to keep state stream for sensor 0 - 999 on node 1, > 1000 > to 1999 on node 2, and etc. Right now, I see sensor state stream is > shuffled > for every batch, which used lot of network bandwidth and it's unnecessary. > > Any suggestions? > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/how-to-setup-steady-state-stream-partitions-tp13850.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >