I would be surprised if Oracle cannot handle million row calculations, unless you are also using other data in Spark.
HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 21 April 2016 at 22:22, Jonathan Gray <jonny.g...@gmail.com> wrote: > I think I know understand what the problem is and it is, in some ways, to > do with partitions and, in other ways, to do with memory. > > I now think that the database write was not the source of the problem (the > problem being end-to-end performance). > > The application reads rows from a database, does some simple derivations > then a group by and sum and joins the results back onto the original > dataframe and finally writes back to a database (at the same granularity as > the read). > > I would run for 100,000 rows locally and end-to-end performance was > reasonable if I went up to 1,000,000 rows performance was far worse than > the increase in volume would suggest. Profiling seemed to indicate that > the read and write were still happening efficiently in both cases but > nothing much was happening in between. Looking at the profiler during this > period indicated that most of the time was being spent in a > Platform.copyMemory(...). As a result I increased the heap and driver > memory and the performance improved dramatically. > > It turns out that the groupBy key results in a bad distribution where 90% > of the data ends up in one partition and at the point of write out I'm > guessing that is when Spark decides it has to copy all of that data onto > one partition/worker. Increasing the memory available appears to give it > the ability to do that. The app has several cache() calls in it to get > around a codegen issue which probably puts more pressure on memory usage > and compounds the problem. > > Still, I have learnt that Spark is excellent in it's ability to be lazy > and optimize across the whole end-to-end process from read-to-write. > > On 21 April 2016 at 13:46, Michael Segel <msegel_had...@hotmail.com> > wrote: > >> How many partitions in your data set. >> >> Per the Spark DataFrameWritetr Java Doc: >> “ >> >> Saves the content of the DataFrame >> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html> >> to a external database table via JDBC. In the case the table already exists >> in the external database, behavior of this function depends on the save >> mode, specified by the mode function (default to throwing an exception). >> >> Don't create too many partitions in parallel on a large cluster; >> otherwise Spark might crash your external database systems. >> >> “ >> >> This implies one connection per partition writing in parallel. So you >> could be swamping your database. >> Which database are you using? >> >> Also, how many hops? >> Network latency could also impact performance too… >> >> On Apr 19, 2016, at 3:14 PM, Jonathan Gray <jonny.g...@gmail.com> wrote: >> >> Hi, >> >> I'm trying to write ~60 million rows from a DataFrame to a database using >> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...) >> >> The write seems to not be performing well. Profiling the application >> with a master of local[*] it appears there is not much socket write >> activity and also not much CPU. >> >> I would expect there to be an almost continuous block of socket write >> activity showing up somewhere in the profile. >> >> I can see that the top hot method involves >> apache.spark.unsafe.platform.CopyMemory all from calls within >> JdbcUtils.savePartition(...). However, the CPU doesn't seem particularly >> stressed so I'm guessing this isn't the cause of the problem. >> >> Is there any best practices or has anyone come across a case like this >> before where a write to a database seems to perform poorly? >> >> Thanks, >> Jon >> >> >> >