Without seeing the rest (and you can confirm this by looking at the DAG visualization in the Spark UI) I would say your first stage with 6 partitions is:
Stage 1: Read data from kinesis (or read blocks from receiver not sure what method you are using) and write shuffle files for repartition Stage 2 : Read shuffle files and do everything else In general I think the biggest issue here is probably not the distribution of tasks which based on your UI reading were quite small, but instead the parallelization of the write operation since it is done synchronously. I would suggest instead of trying to increase your parallelism by partitioning, you attempt to have "doJob" run asynchronously and allow for more parallelism within an executor rather than using multiple executor threads/jvms. That said you probably would run faster if you just skipped the repartition based on the speed of second stage. On Mon, Jul 20, 2020 at 8:23 AM forece85 <forec...@gmail.com> wrote: > Thanks for reply. Please find sudo code below. Its Dstreams reading for > every > 10secs from kinesis stream and after transformations, pushing into hbase. > Once got Dstream, we are using below code to repartition and do processing: > > dStream = dStream.repartition(javaSparkContext.defaultMinPartitions() * 3); > dStream.foreachRDD(javaRDD -> javaRDD.foreachPartition(partitionOfRecords > -> > { > Connection hbaseConnection= ConnectionUtil.getHbaseConnection(); > List<byte[]> listOfRecords = new ArrayList<>(); > while (partitionOfRecords.hasNext()) { > listOfRecords.add(partitionOfRecords.next()); > > if (listOfRecords.size() < 10 && partitionOfRecords.hasNext()) > continue; > > List<byte[]> finalListOfRecords = listOfRecords; > doJob(finalListOfRecords, hbaseConnection); > listOfRecords = new ArrayList<>(); > } > })); > > > We are batching every 10 records and pass to doJob method where we batch > process and bulk insert to hbase. > > With above code, will it be able to tell what is happening at job 1 -> 6 > tasks? and how to replace repartition method efficiently. > > Thanks in Advance > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >