Without your code this is hard to determine but a few notes. The number of partitions is usually dictated by the input source, see if it has any configuration which allows you to increase input splits.
I'm not sure why you think some of the code is running on the driver. All methods on dataframes and rdds will be executed on executors. For each partition is not local. The difference in partitions is probably the shuffle you added with repartition. I would actually be not surprised if your code ran faster without the repartitioning. But again with the real code it would be very hard to say. On Mon, Jul 20, 2020, 6:33 AM forece85 <forec...@gmail.com> wrote: > I am new to spark streaming and trying to understand spark ui and to do > optimizations. > > 1. Processing at executors took less time than at driver. How to optimize > to > make driver tasks fast ? > 2. We are using dstream.repartition(defaultParallelism*3) to increase > parallelism which is causing high shuffles. Is there any option to avoid > repartition manually to reduce data shuffles. > 3. Also trying to understand how 6 tasks in stage1 and 199 tasks in stage2 > got created? > > *hardware configuration:* executor-cores: 3; driver-cores: 3; > dynamicAllocation is true; > initial,min,maxExecutors: 25 > > StackOverFlow link for screenshots: > > https://stackoverflow.com/questions/62993030/spark-dstream-help-needed-to-understand-ui-and-how-to-set-parallelism-or-defau > > Thanks in Advance > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >