I think I know understand what the problem is and it is, in some ways, to
do with partitions and, in other ways, to do with memory.

I now think that the database write was not the source of the problem (the
problem being end-to-end performance).

The application reads rows from a database, does some simple derivations
then a group by and sum and joins the results back onto the original
dataframe and finally writes back to a database (at the same granularity as
the read).

I would run for 100,000 rows locally and end-to-end performance was
reasonable if I went up to 1,000,000 rows performance was far worse than
the increase in volume would suggest.  Profiling seemed to indicate that
the read and write were still happening efficiently in both cases but
nothing much was happening in between.  Looking at the profiler during this
period indicated that most of the time was being spent in a
Platform.copyMemory(...).  As a result I increased the heap and driver
memory and the performance improved dramatically.

It turns out that the groupBy key results in a bad distribution where 90%
of the data ends up in one partition and at the point of write out I'm
guessing that is when Spark decides it has to copy all of that data onto
one partition/worker.  Increasing the memory available appears to give it
the ability to do that.  The app has several cache() calls in it to get
around a codegen issue which probably puts more pressure on memory usage
and compounds the problem.

Still, I have learnt that Spark is excellent in it's ability to be lazy and
optimize across the whole end-to-end process from read-to-write.

On 21 April 2016 at 13:46, Michael Segel <msegel_had...@hotmail.com> wrote:

> How many partitions in your data set.
>
> Per the Spark DataFrameWritetr Java Doc:
> “
>
> Saves the content of the DataFrame
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html>
> to a external database table via JDBC. In the case the table already exists
> in the external database, behavior of this function depends on the save
> mode, specified by the mode function (default to throwing an exception).
>
> Don't create too many partitions in parallel on a large cluster; otherwise
> Spark might crash your external database systems.
>
> “
>
> This implies one connection per partition writing in parallel. So you
> could be swamping your database.
> Which database are you using?
>
> Also, how many hops?
> Network latency could also impact performance too…
>
> On Apr 19, 2016, at 3:14 PM, Jonathan Gray <jonny.g...@gmail.com> wrote:
>
> Hi,
>
> I'm trying to write ~60 million rows from a DataFrame to a database using
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>
> The write seems to not be performing well.  Profiling the application with
> a master of local[*] it appears there is not much socket write activity and
> also not much CPU.
>
> I would expect there to be an almost continuous block of socket write
> activity showing up somewhere in the profile.
>
> I can see that the top hot method involves
> apache.spark.unsafe.platform.CopyMemory all from calls within
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
> stressed so I'm guessing this isn't the cause of the problem.
>
> Is there any best practices or has anyone come across a case like this
> before where a write to a database seems to perform poorly?
>
> Thanks,
> Jon
>
>
>

Reply via email to