Another option that would avoid a shuffle would be to use assign and coalesce, running two separate streams.
spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("assign", """{t0: {"0": xxxx}, t1:{"0": xxxxx}}""") .load() .coalesce(1) .writeStream .foreach(... code to write to cassandra ...) spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("assign", """{t0: {"1": xxxx}, t1:{"1": xxxxx}}""") .load() .coalesce(1) .writeStream .foreach(... code to write to cassandra ...) On Fri, Mar 17, 2017 at 7:35 AM, OUASSAIDI, Sami <sami.ouassa...@mind7.fr> wrote: > @Cody : Duly noted. > @Michael Ambrust : A repartition is out of the question for our project as > it would be a fairly expensive operation. We tried looking into targeting a > specific executor so as to avoid this extra cost and directly have well > partitioned data after consuming the kafka topics. Also we are using Spark > streaming to save to the cassandra DB and try to keep shuffle operations to > a strict minimum (at best none). As of now we are not entirely pleased with > our current performances, that's why I'm doing a kafka topic sharding POC > and getting the executor to handle the specificied partitions is central. > ᐧ > > 2017-03-17 9:14 GMT+01:00 Michael Armbrust <mich...@databricks.com>: > >> Sorry, typo. Should be a repartition not a groupBy. >> >> >>> spark.readStream >>> .format("kafka") >>> .option("kafka.bootstrap.servers", "...") >>> .option("subscribe", "t0,t1") >>> .load() >>> .repartition($"partition") >>> .writeStream >>> .foreach(... code to write to cassandra ...) >>> >> > > > -- > *Mind7 Consulting* > > Sami Ouassaid | Consultant Big Data | sami.ouassa...@mind7.com > __ > > 64 Rue Taitbout, 75009 Paris >