I am not sure about Redshift, but I know the target table is not partitioned. 
But we should be able to just insert into non-partitioned remote table from 12 
clients concurrently, right?


Even let's say Redshift doesn't allow concurrently write, then Spark Driver 
will detect this and coordinating all tasks and executors as I observed?


Yong

________________________________
From: Jörn Franke <jornfra...@gmail.com>
Sent: Friday, May 25, 2018 10:50 AM
To: Yong Zhang
Cc: user@spark.apache.org
Subject: Re: Why Spark JDBC Writing in a sequential order

Can your database receive the writes concurrently ? Ie do you make sure that 
each executor writes into a different partition at database side ?

On 25. May 2018, at 16:42, Yong Zhang 
<java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote:


Spark version 2.2.0


We are trying to write a DataFrame to remote relationship database (AWS 
Redshift). Based on the Spark JDBC document, we already repartition our DF as 
12 and set the spark jdbc to concurrent writing for 12 partitions as 
"numPartitions" parameter.


We run the command as following:

dataframe.repartition(12).write.mode("overwrite").option("batchsize", 
5000).option("numPartitions", 12).jdbc(url=jdbcurl, table="tableName", 
connectionProperties=connectionProps)


Here is the Spark UI:

<Screen Shot 2018-05-25 at 10.21.50 AM.png>


We found out that the 12 tasks obviously are running in sequential order. They 
are all in "Running" status in the beginning at the same time, but if we check 
the "Duration" and "Shuffle Read Size/Records" of them, it is clear that they 
are run one by one.

For example, task 8 finished first in about 2 hours, and wrote 34732 records to 
remote DB (I knew the speed looks terrible, but that's not the question of this 
post), and task 0 started after task 8, and took 4 hours (first 2 hours waiting 
for task 8).

In this picture, only task 2 and 4 are in running stage, but task 4 is 
obviously waiting for task 2 to finish, then start writing after that.


My question is, in the above Spark command, my understanding that 12 executors 
should open the JDBC connection to the remote DB concurrently, and all 12 tasks 
should start writing also in concurrent, and whole job should finish around 2 
hours overall.


Why 12 tasks indeed are in "RUNNING" stage, but looks like waiting for 
something, and can ONLY write to remote DB sequentially? The 12 executors are 
on different JVMs on different physical nodes. Why this is happening? What 
stops Spark pushing the data truly concurrent?


Thanks


Yong

Reply via email to