I would be surprised if Oracle cannot handle million row calculations,
unless you are also using other data in Spark.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 April 2016 at 22:22, Jonathan Gray <jonny.g...@gmail.com> wrote:

> I think I know understand what the problem is and it is, in some ways, to
> do with partitions and, in other ways, to do with memory.
>
> I now think that the database write was not the source of the problem (the
> problem being end-to-end performance).
>
> The application reads rows from a database, does some simple derivations
> then a group by and sum and joins the results back onto the original
> dataframe and finally writes back to a database (at the same granularity as
> the read).
>
> I would run for 100,000 rows locally and end-to-end performance was
> reasonable if I went up to 1,000,000 rows performance was far worse than
> the increase in volume would suggest.  Profiling seemed to indicate that
> the read and write were still happening efficiently in both cases but
> nothing much was happening in between.  Looking at the profiler during this
> period indicated that most of the time was being spent in a
> Platform.copyMemory(...).  As a result I increased the heap and driver
> memory and the performance improved dramatically.
>
> It turns out that the groupBy key results in a bad distribution where 90%
> of the data ends up in one partition and at the point of write out I'm
> guessing that is when Spark decides it has to copy all of that data onto
> one partition/worker.  Increasing the memory available appears to give it
> the ability to do that.  The app has several cache() calls in it to get
> around a codegen issue which probably puts more pressure on memory usage
> and compounds the problem.
>
> Still, I have learnt that Spark is excellent in it's ability to be lazy
> and optimize across the whole end-to-end process from read-to-write.
>
> On 21 April 2016 at 13:46, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>> How many partitions in your data set.
>>
>> Per the Spark DataFrameWritetr Java Doc:
>> “
>>
>> Saves the content of the DataFrame
>> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html>
>> to a external database table via JDBC. In the case the table already exists
>> in the external database, behavior of this function depends on the save
>> mode, specified by the mode function (default to throwing an exception).
>>
>> Don't create too many partitions in parallel on a large cluster;
>> otherwise Spark might crash your external database systems.
>>
>> “
>>
>> This implies one connection per partition writing in parallel. So you
>> could be swamping your database.
>> Which database are you using?
>>
>> Also, how many hops?
>> Network latency could also impact performance too…
>>
>> On Apr 19, 2016, at 3:14 PM, Jonathan Gray <jonny.g...@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm trying to write ~60 million rows from a DataFrame to a database using
>> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>>
>> The write seems to not be performing well.  Profiling the application
>> with a master of local[*] it appears there is not much socket write
>> activity and also not much CPU.
>>
>> I would expect there to be an almost continuous block of socket write
>> activity showing up somewhere in the profile.
>>
>> I can see that the top hot method involves
>> apache.spark.unsafe.platform.CopyMemory all from calls within
>> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
>> stressed so I'm guessing this isn't the cause of the problem.
>>
>> Is there any best practices or has anyone come across a case like this
>> before where a write to a database seems to perform poorly?
>>
>> Thanks,
>> Jon
>>
>>
>>
>

Reply via email to