Hi ,

   - Do you have adequate CPU cores allocated to handle increased
   partitions ,generally if you have Kafka partitions >=(greater than or equal
   to) CPU Cores Total (Number of Executor Instances * Per Executor Core)
   ,gives increased task parallelism for reader phase.
   - However if you have too many partitions but not enough cores ,it would
   eventually slow down the reader (Ex: 100 Partitions and only 20 Total
   Cores).
   - Additionally ,the next set of transformation will have there own
   partitions ,if its involving  shuffle ,sq.shuffle.partitions then defines
   next level of parallelism ,if you are not having any data skew,then you
   should get good performance.


Regards,
Shahbaz

On Wed, Nov 7, 2018 at 12:58 PM JF Chen <darou...@gmail.com> wrote:

> I have a Spark Streaming application which reads data from kafka and save
> the the transformation result to hdfs.
> My original partition number of kafka topic is 8, and repartition the data
> to 100 to increase the parallelism of spark job.
> Now I am wondering if I increase the kafka partition number to 100 instead
> of setting repartition to 100, will the performance be enhanced? (I know
> repartition action cost a lot cpu resource)
> If I set the kafka partition number to 100, does it have any negative
> efficiency?
> I just have one production environment so it's not convenient for me to do
> the test....
>
> Thanks!
>
> Regard,
> Junfeng Chen
>

Reply via email to