Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Mich Talebzadeh
I would be surprised if Oracle cannot handle million row calculations,
unless you are also using other data in Spark.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 21 April 2016 at 22:22, Jonathan Gray  wrote:

> I think I know understand what the problem is and it is, in some ways, to
> do with partitions and, in other ways, to do with memory.
>
> I now think that the database write was not the source of the problem (the
> problem being end-to-end performance).
>
> The application reads rows from a database, does some simple derivations
> then a group by and sum and joins the results back onto the original
> dataframe and finally writes back to a database (at the same granularity as
> the read).
>
> I would run for 100,000 rows locally and end-to-end performance was
> reasonable if I went up to 1,000,000 rows performance was far worse than
> the increase in volume would suggest.  Profiling seemed to indicate that
> the read and write were still happening efficiently in both cases but
> nothing much was happening in between.  Looking at the profiler during this
> period indicated that most of the time was being spent in a
> Platform.copyMemory(...).  As a result I increased the heap and driver
> memory and the performance improved dramatically.
>
> It turns out that the groupBy key results in a bad distribution where 90%
> of the data ends up in one partition and at the point of write out I'm
> guessing that is when Spark decides it has to copy all of that data onto
> one partition/worker.  Increasing the memory available appears to give it
> the ability to do that.  The app has several cache() calls in it to get
> around a codegen issue which probably puts more pressure on memory usage
> and compounds the problem.
>
> Still, I have learnt that Spark is excellent in it's ability to be lazy
> and optimize across the whole end-to-end process from read-to-write.
>
> On 21 April 2016 at 13:46, Michael Segel 
> wrote:
>
>> How many partitions in your data set.
>>
>> Per the Spark DataFrameWritetr Java Doc:
>> “
>>
>> Saves the content of the DataFrame
>> 
>> to a external database table via JDBC. In the case the table already exists
>> in the external database, behavior of this function depends on the save
>> mode, specified by the mode function (default to throwing an exception).
>>
>> Don't create too many partitions in parallel on a large cluster;
>> otherwise Spark might crash your external database systems.
>>
>> “
>>
>> This implies one connection per partition writing in parallel. So you
>> could be swamping your database.
>> Which database are you using?
>>
>> Also, how many hops?
>> Network latency could also impact performance too…
>>
>> On Apr 19, 2016, at 3:14 PM, Jonathan Gray  wrote:
>>
>> Hi,
>>
>> I'm trying to write ~60 million rows from a DataFrame to a database using
>> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>>
>> The write seems to not be performing well.  Profiling the application
>> with a master of local[*] it appears there is not much socket write
>> activity and also not much CPU.
>>
>> I would expect there to be an almost continuous block of socket write
>> activity showing up somewhere in the profile.
>>
>> I can see that the top hot method involves
>> apache.spark.unsafe.platform.CopyMemory all from calls within
>> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
>> stressed so I'm guessing this isn't the cause of the problem.
>>
>> Is there any best practices or has anyone come across a case like this
>> before where a write to a database seems to perform poorly?
>>
>> Thanks,
>> Jon
>>
>>
>>
>


Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Jonathan Gray
I think I know understand what the problem is and it is, in some ways, to
do with partitions and, in other ways, to do with memory.

I now think that the database write was not the source of the problem (the
problem being end-to-end performance).

The application reads rows from a database, does some simple derivations
then a group by and sum and joins the results back onto the original
dataframe and finally writes back to a database (at the same granularity as
the read).

I would run for 100,000 rows locally and end-to-end performance was
reasonable if I went up to 1,000,000 rows performance was far worse than
the increase in volume would suggest.  Profiling seemed to indicate that
the read and write were still happening efficiently in both cases but
nothing much was happening in between.  Looking at the profiler during this
period indicated that most of the time was being spent in a
Platform.copyMemory(...).  As a result I increased the heap and driver
memory and the performance improved dramatically.

It turns out that the groupBy key results in a bad distribution where 90%
of the data ends up in one partition and at the point of write out I'm
guessing that is when Spark decides it has to copy all of that data onto
one partition/worker.  Increasing the memory available appears to give it
the ability to do that.  The app has several cache() calls in it to get
around a codegen issue which probably puts more pressure on memory usage
and compounds the problem.

Still, I have learnt that Spark is excellent in it's ability to be lazy and
optimize across the whole end-to-end process from read-to-write.

On 21 April 2016 at 13:46, Michael Segel  wrote:

> How many partitions in your data set.
>
> Per the Spark DataFrameWritetr Java Doc:
> “
>
> Saves the content of the DataFrame
> 
> to a external database table via JDBC. In the case the table already exists
> in the external database, behavior of this function depends on the save
> mode, specified by the mode function (default to throwing an exception).
>
> Don't create too many partitions in parallel on a large cluster; otherwise
> Spark might crash your external database systems.
>
> “
>
> This implies one connection per partition writing in parallel. So you
> could be swamping your database.
> Which database are you using?
>
> Also, how many hops?
> Network latency could also impact performance too…
>
> On Apr 19, 2016, at 3:14 PM, Jonathan Gray  wrote:
>
> Hi,
>
> I'm trying to write ~60 million rows from a DataFrame to a database using
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>
> The write seems to not be performing well.  Profiling the application with
> a master of local[*] it appears there is not much socket write activity and
> also not much CPU.
>
> I would expect there to be an almost continuous block of socket write
> activity showing up somewhere in the profile.
>
> I can see that the top hot method involves
> apache.spark.unsafe.platform.CopyMemory all from calls within
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
> stressed so I'm guessing this isn't the cause of the problem.
>
> Is there any best practices or has anyone come across a case like this
> before where a write to a database seems to perform poorly?
>
> Thanks,
> Jon
>
>
>


Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Jonathan Gray
I tried increasing the batch size (1000 to 10,000 to 100,000) but it didn't
appear to make any appreciable difference in my test case.

In addition I had read in the Oracle JDBC documentation that batches should
be set between 10 and 100 and anything out of that range was not advisable.
However, I don't have any evidence to prove or disprove that.
On 21 Apr 2016 6:16 am, "Takeshi Yamamuro"  wrote:

> Sorry to wrongly send message in mid.
> How about trying to increate 'batchsize` in a jdbc option to improve
> performance?
>
> // maropu
>
> On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> How about trying to increate 'batchsize
>>
>> On Wed, Apr 20, 2016 at 7:14 AM, Jonathan Gray 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to write ~60 million rows from a DataFrame to a database
>>> using JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>>>
>>> The write seems to not be performing well.  Profiling the application
>>> with a master of local[*] it appears there is not much socket write
>>> activity and also not much CPU.
>>>
>>> I would expect there to be an almost continuous block of socket write
>>> activity showing up somewhere in the profile.
>>>
>>> I can see that the top hot method involves
>>> apache.spark.unsafe.platform.CopyMemory all from calls within
>>> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
>>> stressed so I'm guessing this isn't the cause of the problem.
>>>
>>> Is there any best practices or has anyone come across a case like this
>>> before where a write to a database seems to perform poorly?
>>>
>>> Thanks,
>>> Jon
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Michael Segel
How many partitions in your data set. 

Per the Spark DataFrameWritetr Java Doc:
“
Saves the content of the DataFrame 

 to a external database table via JDBC. In the case the table already exists in 
the external database, behavior of this function depends on the save mode, 
specified by the mode function (default to throwing an exception).
Don't create too many partitions in parallel on a large cluster; otherwise 
Spark might crash your external database systems.  

“

This implies one connection per partition writing in parallel. So you could be 
swamping your database. 
Which database are you using? 

Also, how many hops? 
Network latency could also impact performance too… 

> On Apr 19, 2016, at 3:14 PM, Jonathan Gray  wrote:
> 
> Hi,
> 
> I'm trying to write ~60 million rows from a DataFrame to a database using 
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
> 
> The write seems to not be performing well.  Profiling the application with a 
> master of local[*] it appears there is not much socket write activity and 
> also not much CPU.
> 
> I would expect there to be an almost continuous block of socket write 
> activity showing up somewhere in the profile.
> 
> I can see that the top hot method involves 
> apache.spark.unsafe.platform.CopyMemory all from calls within 
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly 
> stressed so I'm guessing this isn't the cause of the problem.
> 
> Is there any best practices or has anyone come across a case like this before 
> where a write to a database seems to perform poorly?
> 
> Thanks,
> Jon



Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Mich Talebzadeh
What is the end database, Have you checked the performance of your query at
the target?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 19 April 2016 at 23:14, Jonathan Gray  wrote:

> Hi,
>
> I'm trying to write ~60 million rows from a DataFrame to a database using
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>
> The write seems to not be performing well.  Profiling the application with
> a master of local[*] it appears there is not much socket write activity and
> also not much CPU.
>
> I would expect there to be an almost continuous block of socket write
> activity showing up somewhere in the profile.
>
> I can see that the top hot method involves
> apache.spark.unsafe.platform.CopyMemory all from calls within
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
> stressed so I'm guessing this isn't the cause of the problem.
>
> Is there any best practices or has anyone come across a case like this
> before where a write to a database seems to perform poorly?
>
> Thanks,
> Jon
>


Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Jörn Franke
Well it could also depend on the receiving database. You should also check the 
executors. Updating to the latest version of the JDBC driver and JDK8, if 
supported by JDBC driver, could help.

> On 20 Apr 2016, at 00:14, Jonathan Gray  wrote:
> 
> Hi,
> 
> I'm trying to write ~60 million rows from a DataFrame to a database using 
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
> 
> The write seems to not be performing well.  Profiling the application with a 
> master of local[*] it appears there is not much socket write activity and 
> also not much CPU.
> 
> I would expect there to be an almost continuous block of socket write 
> activity showing up somewhere in the profile.
> 
> I can see that the top hot method involves 
> apache.spark.unsafe.platform.CopyMemory all from calls within 
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly 
> stressed so I'm guessing this isn't the cause of the problem.
> 
> Is there any best practices or has anyone come across a case like this before 
> where a write to a database seems to perform poorly?
> 
> Thanks,
> Jon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Takeshi Yamamuro
Sorry to wrongly send message in mid.
How about trying to increate 'batchsize` in a jdbc option to improve
performance?

// maropu

On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> How about trying to increate 'batchsize
>
> On Wed, Apr 20, 2016 at 7:14 AM, Jonathan Gray 
> wrote:
>
>> Hi,
>>
>> I'm trying to write ~60 million rows from a DataFrame to a database using
>> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>>
>> The write seems to not be performing well.  Profiling the application
>> with a master of local[*] it appears there is not much socket write
>> activity and also not much CPU.
>>
>> I would expect there to be an almost continuous block of socket write
>> activity showing up somewhere in the profile.
>>
>> I can see that the top hot method involves
>> apache.spark.unsafe.platform.CopyMemory all from calls within
>> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
>> stressed so I'm guessing this isn't the cause of the problem.
>>
>> Is there any best practices or has anyone come across a case like this
>> before where a write to a database seems to perform poorly?
>>
>> Thanks,
>> Jon
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
---
Takeshi Yamamuro


Spark 1.6.1 DataFrame write to JDBC

2016-04-19 Thread Jonathan Gray
Hi,

I'm trying to write ~60 million rows from a DataFrame to a database using
JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)

The write seems to not be performing well.  Profiling the application with
a master of local[*] it appears there is not much socket write activity and
also not much CPU.

I would expect there to be an almost continuous block of socket write
activity showing up somewhere in the profile.

I can see that the top hot method involves
apache.spark.unsafe.platform.CopyMemory all from calls within
JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
stressed so I'm guessing this isn't the cause of the problem.

Is there any best practices or has anyone come across a case like this
before where a write to a database seems to perform poorly?

Thanks,
Jon