Re: Spark Streaming - Set Parallelism and Optimize driver

Russell Spitzer Mon, 20 Jul 2020 08:08:02 -0700

Without seeing the rest (and you can confirm this by looking at the DAG
visualization in the Spark UI) I would say your first stage with 6
partitions is:


Stage 1: Read data from kinesis (or read blocks from receiver not sure what
method you are using) and write shuffle files for repartition
Stage 2 : Read shuffle files and do everything else

In general I think the biggest issue here is probably not the distribution
of tasks which based on your UI reading were quite small, but instead the
parallelization of the write operation since it is done synchronously. I
would suggest instead of trying to increase your parallelism by
partitioning, you attempt to have "doJob" run asynchronously and allow for
more parallelism within an executor rather than using multiple executor
threads/jvms.

That said you probably would run faster if you just skipped the repartition
based on the speed of second stage.

On Mon, Jul 20, 2020 at 8:23 AM forece85 <forec...@gmail.com> wrote:

> Thanks for reply. Please find sudo code below. Its Dstreams reading for
> every
> 10secs from kinesis stream and after transformations, pushing into hbase.
> Once got Dstream, we are using below code to repartition and do processing:
>
> dStream = dStream.repartition(javaSparkContext.defaultMinPartitions() * 3);
> dStream.foreachRDD(javaRDD -> javaRDD.foreachPartition(partitionOfRecords
> ->
> {
>    Connection hbaseConnection= ConnectionUtil.getHbaseConnection();
>    List<byte[]> listOfRecords = new ArrayList<>();
>    while (partitionOfRecords.hasNext()) {
>          listOfRecords.add(partitionOfRecords.next());
>
>          if (listOfRecords.size() < 10 && partitionOfRecords.hasNext())
>                             continue;
>
>          List<byte[]> finalListOfRecords = listOfRecords;
>          doJob(finalListOfRecords, hbaseConnection);
>          listOfRecords = new ArrayList<>();
>    }
> }));
>
>
> We are batching every 10 records and pass to doJob method where we batch
> process and bulk insert to hbase.
>
> With above code, will it be able to tell what is happening at job 1 -> 6
> tasks? and how to replace repartition method efficiently.
>
> Thanks in Advance
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark Streaming - Set Parallelism and Optimize driver

Reply via email to