Re: Spark Streaming - Set Parallelism and Optimize driver

forece85 Mon, 20 Jul 2020 06:21:04 -0700

Thanks for reply. Please find sudo code below. We are fetching Dstreams from
kinesis stream for every 10sec and performing transformations and finally
persisting to hbase tables using batch insertions.


dStream = dStream.repartition(jssc.defaultMinPartitions() * 3);
dStream.foreachRDD(javaRDD -> javaRDD.foreachPartition(partitionOfRecords ->
{
                Connection hbaseConnection =
ConnectionUtil.getHbaseConnection();
                List<byte[]> listOfRecords = new ArrayList<>();
                while (partitionOfRecords.hasNext()) {
                    try {
                        listOfRecords.add(partitionOfRecords.next());

                        if (listOfRecords.size() < 10 &&
partitionOfRecords.hasNext())
                            continue;

                        List<byte[]> finalListOfRecords = listOfRecords;
                        doJob(finalListOfRecords, primaryConnection,
lookupsConnection);
                        listOfRecords = new ArrayList<>();
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                }
            })); 

We are batching every 10 records and sending to doJob method where actual
transformations happen and every batch will get batch inserted to hbase
table.

With above code can we guess whats happening at Job 1 => 6 tasks and how to
reduce that time. 
Mainly how to effectively set parallelism avoiding repartition() method.

Thanks in Advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming - Set Parallelism and Optimize driver

Reply via email to