How many partitions in your data set. 

Per the Spark DataFrameWritetr Java Doc:
“
Saves the content of the DataFrame 
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html>
 to a external database table via JDBC. In the case the table already exists in 
the external database, behavior of this function depends on the save mode, 
specified by the mode function (default to throwing an exception).
Don't create too many partitions in parallel on a large cluster; otherwise 
Spark might crash your external database systems.  

“

This implies one connection per partition writing in parallel. So you could be 
swamping your database. 
Which database are you using? 

Also, how many hops? 
Network latency could also impact performance too… 

> On Apr 19, 2016, at 3:14 PM, Jonathan Gray <jonny.g...@gmail.com> wrote:
> 
> Hi,
> 
> I'm trying to write ~60 million rows from a DataFrame to a database using 
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
> 
> The write seems to not be performing well.  Profiling the application with a 
> master of local[*] it appears there is not much socket write activity and 
> also not much CPU.
> 
> I would expect there to be an almost continuous block of socket write 
> activity showing up somewhere in the profile.
> 
> I can see that the top hot method involves 
> apache.spark.unsafe.platform.CopyMemory all from calls within 
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly 
> stressed so I'm guessing this isn't the cause of the problem.
> 
> Is there any best practices or has anyone come across a case like this before 
> where a write to a database seems to perform poorly?
> 
> Thanks,
> Jon

Reply via email to