@Cody : Duly noted.
@Michael Ambrust : A repartition is out of the question for our project as
it would be a fairly expensive operation. We tried looking into targeting a
specific executor so as to avoid this extra cost and directly have well
partitioned data after consuming the kafka topics. Also we are using Spark
streaming to save to the cassandra DB and try to keep shuffle operations to
a strict minimum (at best none). As of now we are not entirely pleased with
our current performances, that's why I'm doing a kafka topic sharding POC
and getting the executor to handle the specificied partitions is central.
ᐧ

2017-03-17 9:14 GMT+01:00 Michael Armbrust <[email protected]>:

> Sorry, typo.  Should be a repartition not a groupBy.
>
>
>> spark.readStream
>>   .format("kafka")
>>   .option("kafka.bootstrap.servers", "...")
>>   .option("subscribe", "t0,t1")
>>   .load()
>>   .repartition($"partition")
>>   .writeStream
>>   .foreach(... code to write to cassandra ...)
>>
>


-- 
*Mind7 Consulting*

Sami Ouassaid | Consultant Big Data | [email protected]
__

64 Rue Taitbout, 75009 Paris

Reply via email to