Re: spark session jdbc performance

2017-10-25 Thread Gourav Sengupta
Hi Naveen,

Can you please copy and paste the lines in your original email again, and
perhaps then Lucas can go through it completely & kindly stop thinking that
others are responding by assuming things?

On other hand, please try to let me know how things are going on, there is
another post on this a few weeks back and several other users are equally
finding this issue very interesting to resolve.

I might just have the solution for this.

Regards,
Gourav Sengupta

On Wed, Oct 25, 2017 at 9:26 PM, lucas.g...@gmail.com 
wrote:

> Are we seeing the UI is showing only one partition to run the query?  The
> original poster hasn't replied yet.
>
> My assumption is that there's only one executor configured / deployed.
> But we only know what the OP stated which wasn't enough to be sure of
> anything.
>
> Why are you suggesting that partitioning on the PK isn't prudent? and did
> you mean to say that 30 partitions were far to many for any system to
> handle?  (I'm assuming you misread the original code)
>
> Gary
>
> On 25 October 2017 at 13:21, Gourav Sengupta 
> wrote:
>
>> Hi Lucas,
>>
>> so if I am assuming things, can you please explain why the UI is showing
>> only one partition to run the query?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> Gourav, I'm assuming you misread the code.  It's 30 partitions, which
>>> isn't a ridiculous value.  Maybe you misread the upperBound for the
>>> partitions?  (That would be ridiculous)
>>>
>>> Why not use the PK as the partition column?  Obviously it depends on the
>>> downstream queries.  If you're going to be performing joins (which I assume
>>> is the case) then partitioning on the join column would be advisable, but
>>> what about the case where the join column would be heavily skewed?
>>>
>>> Thanks!
>>>
>>> Gary
>>>
>>> On 24 October 2017 at 23:41, Gourav Sengupta 
>>> wrote:
>>>
 Hi Naveen,

 I do not think that it is prudent to use the PK as the partitionColumn.
 That is too many partitions for any system to handle. The numPartitions
 will be valid in case of JDBC very differently.

 Please keep me updated on how things go.


 Regards,
 Gourav Sengupta

 On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire >>> > wrote:

>
> Hi,
>
>
>
> I am trying to fetch data from Oracle DB using a subquery and
> experiencing lot of performance issues.
>
>
>
> Below is the query I am using,
>
>
>
> *Using Spark 2.0.2*
>
>
>
> *val *df = spark_session.read.format(*"jdbc"*)
> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
> .option(*"url"*, jdbc_url)
>.option(*"user"*, user)
>.option(*"password"*, pwd)
>.option(*"dbtable"*, *"subquery"*)
>.option(*"partitionColumn"*, *"id"*)  //primary key column
> uniformly distributed
>.option(*"lowerBound"*, *"1"*)
>.option(*"upperBound"*, *"50"*)
> .option(*"numPartitions"*, 30)
> .load()
>
>
>
> The above query is running using the 30 partitions, but when I see the
> UI it is only using 1 partiton to run the query.
>
>
>
> Can anyone tell if I am missing anything or do I need to anything else
> to tune the performance of the query.
>
>  *Thanks*
>


>>>
>>
>


Re: spark session jdbc performance

2017-10-25 Thread lucas.g...@gmail.com
Are we seeing the UI is showing only one partition to run the query?  The
original poster hasn't replied yet.

My assumption is that there's only one executor configured / deployed.  But
we only know what the OP stated which wasn't enough to be sure of anything.

Why are you suggesting that partitioning on the PK isn't prudent? and did
you mean to say that 30 partitions were far to many for any system to
handle?  (I'm assuming you misread the original code)

Gary

On 25 October 2017 at 13:21, Gourav Sengupta 
wrote:

> Hi Lucas,
>
> so if I am assuming things, can you please explain why the UI is showing
> only one partition to run the query?
>
>
> Regards,
> Gourav Sengupta
>
> On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> Gourav, I'm assuming you misread the code.  It's 30 partitions, which
>> isn't a ridiculous value.  Maybe you misread the upperBound for the
>> partitions?  (That would be ridiculous)
>>
>> Why not use the PK as the partition column?  Obviously it depends on the
>> downstream queries.  If you're going to be performing joins (which I assume
>> is the case) then partitioning on the join column would be advisable, but
>> what about the case where the join column would be heavily skewed?
>>
>> Thanks!
>>
>> Gary
>>
>> On 24 October 2017 at 23:41, Gourav Sengupta 
>> wrote:
>>
>>> Hi Naveen,
>>>
>>> I do not think that it is prudent to use the PK as the partitionColumn.
>>> That is too many partitions for any system to handle. The numPartitions
>>> will be valid in case of JDBC very differently.
>>>
>>> Please keep me updated on how things go.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire 
>>> wrote:
>>>

 Hi,



 I am trying to fetch data from Oracle DB using a subquery and
 experiencing lot of performance issues.



 Below is the query I am using,



 *Using Spark 2.0.2*



 *val *df = spark_session.read.format(*"jdbc"*)
 .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
 .option(*"url"*, jdbc_url)
.option(*"user"*, user)
.option(*"password"*, pwd)
.option(*"dbtable"*, *"subquery"*)
.option(*"partitionColumn"*, *"id"*)  //primary key column
 uniformly distributed
.option(*"lowerBound"*, *"1"*)
.option(*"upperBound"*, *"50"*)
 .option(*"numPartitions"*, 30)
 .load()



 The above query is running using the 30 partitions, but when I see the
 UI it is only using 1 partiton to run the query.



 Can anyone tell if I am missing anything or do I need to anything else
 to tune the performance of the query.

  *Thanks*

>>>
>>>
>>
>


Re: spark session jdbc performance

2017-10-25 Thread Gourav Sengupta
Hi Lucas,

so if I am assuming things, can you please explain why the UI is showing
only one partition to run the query?


Regards,
Gourav Sengupta

On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com 
wrote:

> Gourav, I'm assuming you misread the code.  It's 30 partitions, which
> isn't a ridiculous value.  Maybe you misread the upperBound for the
> partitions?  (That would be ridiculous)
>
> Why not use the PK as the partition column?  Obviously it depends on the
> downstream queries.  If you're going to be performing joins (which I assume
> is the case) then partitioning on the join column would be advisable, but
> what about the case where the join column would be heavily skewed?
>
> Thanks!
>
> Gary
>
> On 24 October 2017 at 23:41, Gourav Sengupta 
> wrote:
>
>> Hi Naveen,
>>
>> I do not think that it is prudent to use the PK as the partitionColumn.
>> That is too many partitions for any system to handle. The numPartitions
>> will be valid in case of JDBC very differently.
>>
>> Please keep me updated on how things go.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am trying to fetch data from Oracle DB using a subquery and
>>> experiencing lot of performance issues.
>>>
>>>
>>>
>>> Below is the query I am using,
>>>
>>>
>>>
>>> *Using Spark 2.0.2*
>>>
>>>
>>>
>>> *val *df = spark_session.read.format(*"jdbc"*)
>>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
>>> .option(*"url"*, jdbc_url)
>>>.option(*"user"*, user)
>>>.option(*"password"*, pwd)
>>>.option(*"dbtable"*, *"subquery"*)
>>>.option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
>>> distributed
>>>.option(*"lowerBound"*, *"1"*)
>>>.option(*"upperBound"*, *"50"*)
>>> .option(*"numPartitions"*, 30)
>>> .load()
>>>
>>>
>>>
>>> The above query is running using the 30 partitions, but when I see the
>>> UI it is only using 1 partiton to run the query.
>>>
>>>
>>>
>>> Can anyone tell if I am missing anything or do I need to anything else
>>> to tune the performance of the query.
>>>
>>>  *Thanks*
>>>
>>
>>
>


Re: spark session jdbc performance

2017-10-25 Thread lucas.g...@gmail.com
Gourav, I'm assuming you misread the code.  It's 30 partitions, which isn't
a ridiculous value.  Maybe you misread the upperBound for the partitions?
(That would be ridiculous)

Why not use the PK as the partition column?  Obviously it depends on the
downstream queries.  If you're going to be performing joins (which I assume
is the case) then partitioning on the join column would be advisable, but
what about the case where the join column would be heavily skewed?

Thanks!

Gary

On 24 October 2017 at 23:41, Gourav Sengupta 
wrote:

> Hi Naveen,
>
> I do not think that it is prudent to use the PK as the partitionColumn.
> That is too many partitions for any system to handle. The numPartitions
> will be valid in case of JDBC very differently.
>
> Please keep me updated on how things go.
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire 
> wrote:
>
>>
>> Hi,
>>
>>
>>
>> I am trying to fetch data from Oracle DB using a subquery and
>> experiencing lot of performance issues.
>>
>>
>>
>> Below is the query I am using,
>>
>>
>>
>> *Using Spark 2.0.2*
>>
>>
>>
>> *val *df = spark_session.read.format(*"jdbc"*)
>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
>> .option(*"url"*, jdbc_url)
>>.option(*"user"*, user)
>>.option(*"password"*, pwd)
>>.option(*"dbtable"*, *"subquery"*)
>>.option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
>> distributed
>>.option(*"lowerBound"*, *"1"*)
>>.option(*"upperBound"*, *"50"*)
>> .option(*"numPartitions"*, 30)
>> .load()
>>
>>
>>
>> The above query is running using the 30 partitions, but when I see the UI
>> it is only using 1 partiton to run the query.
>>
>>
>>
>> Can anyone tell if I am missing anything or do I need to anything else to
>> tune the performance of the query.
>>
>>  *Thanks*
>>
>
>


Re: spark session jdbc performance

2017-10-24 Thread Gourav Sengupta
Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn.
That is too many partitions for any system to handle. The numPartitions
will be valid in case of JDBC very differently.

Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire 
wrote:

>
> Hi,
>
>
>
> I am trying to fetch data from Oracle DB using a subquery and experiencing
> lot of performance issues.
>
>
>
> Below is the query I am using,
>
>
>
> *Using Spark 2.0.2*
>
>
>
> *val *df = spark_session.read.format(*"jdbc"*)
> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
> .option(*"url"*, jdbc_url)
>.option(*"user"*, user)
>.option(*"password"*, pwd)
>.option(*"dbtable"*, *"subquery"*)
>.option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
> distributed
>.option(*"lowerBound"*, *"1"*)
>.option(*"upperBound"*, *"50"*)
> .option(*"numPartitions"*, 30)
> .load()
>
>
>
> The above query is running using the 30 partitions, but when I see the UI
> it is only using 1 partiton to run the query.
>
>
>
> Can anyone tell if I am missing anything or do I need to anything else to
> tune the performance of the query.
>
>  *Thanks*
>


Re: spark session jdbc performance

2017-10-24 Thread Srinivasa Reddy Tatiredidgari
Hi, is the subquery is user defined sqls or table name in db.If it is user 
Defined sql.Make sure ur partition column is in main select clause.

Sent from Yahoo Mail on Android 
 
  On Wed, Oct 25, 2017 at 3:25, Naveen Madhire wrote:   

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot 
of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "50")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is 
only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune 
the performance of the query.

 Thanks
  


Re: spark session jdbc performance

2017-10-24 Thread lucas.g...@gmail.com
Sorry, I meant to say: "That code looks SANE to me"

Assuming that you're seeing the query running partitioned as expected then
you're likely configured with one executor.  Very easy to check in the UI.

Gary Lucas

On 24 October 2017 at 16:09, lucas.g...@gmail.com 
wrote:

> Did you check the query plan / check the UI?
>
> That code looks same to me.  Maybe you've only configured for one executor?
>
> Gary
>
> On Oct 24, 2017 2:55 PM, "Naveen Madhire"  wrote:
>
>>
>> Hi,
>>
>>
>>
>> I am trying to fetch data from Oracle DB using a subquery and
>> experiencing lot of performance issues.
>>
>>
>>
>> Below is the query I am using,
>>
>>
>>
>> *Using Spark 2.0.2*
>>
>>
>>
>> *val *df = spark_session.read.format(*"jdbc"*)
>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
>> .option(*"url"*, jdbc_url)
>>.option(*"user"*, user)
>>.option(*"password"*, pwd)
>>.option(*"dbtable"*, *"subquery"*)
>>.option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
>> distributed
>>.option(*"lowerBound"*, *"1"*)
>>.option(*"upperBound"*, *"50"*)
>> .option(*"numPartitions"*, 30)
>> .load()
>>
>>
>>
>> The above query is running using the 30 partitions, but when I see the UI
>> it is only using 1 partiton to run the query.
>>
>>
>>
>> Can anyone tell if I am missing anything or do I need to anything else to
>> tune the performance of the query.
>>
>>  *Thanks*
>>
>


Re: spark session jdbc performance

2017-10-24 Thread lucas.g...@gmail.com
Did you check the query plan / check the UI?

That code looks same to me.  Maybe you've only configured for one executor?

Gary

On Oct 24, 2017 2:55 PM, "Naveen Madhire"  wrote:

>
> Hi,
>
>
>
> I am trying to fetch data from Oracle DB using a subquery and experiencing
> lot of performance issues.
>
>
>
> Below is the query I am using,
>
>
>
> *Using Spark 2.0.2*
>
>
>
> *val *df = spark_session.read.format(*"jdbc"*)
> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
> .option(*"url"*, jdbc_url)
>.option(*"user"*, user)
>.option(*"password"*, pwd)
>.option(*"dbtable"*, *"subquery"*)
>.option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
> distributed
>.option(*"lowerBound"*, *"1"*)
>.option(*"upperBound"*, *"50"*)
> .option(*"numPartitions"*, 30)
> .load()
>
>
>
> The above query is running using the 30 partitions, but when I see the UI
> it is only using 1 partiton to run the query.
>
>
>
> Can anyone tell if I am missing anything or do I need to anything else to
> tune the performance of the query.
>
>  *Thanks*
>


spark session jdbc performance

2017-10-24 Thread Naveen Madhire
Hi,



I am trying to fetch data from Oracle DB using a subquery and experiencing
lot of performance issues.



Below is the query I am using,



*Using Spark 2.0.2*



*val *df = spark_session.read.format(*"jdbc"*)
.option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
.option(*"url"*, jdbc_url)
   .option(*"user"*, user)
   .option(*"password"*, pwd)
   .option(*"dbtable"*, *"subquery"*)
   .option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
distributed
   .option(*"lowerBound"*, *"1"*)
   .option(*"upperBound"*, *"50"*)
.option(*"numPartitions"*, 30)
.load()



The above query is running using the 30 partitions, but when I see the UI
it is only using 1 partiton to run the query.



Can anyone tell if I am missing anything or do I need to anything else to
tune the performance of the query.

 *Thanks*


spark session jdbc performance

2017-10-24 Thread Madhire, Naveen
Hi,

I am trying to fetch data from Oracle DB using a subquery and experiencing lot 
of performance issues.

Below is the query I am using,

Using Spark 2.0.2

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "50")
.option("numPartitions", 30)
.load()

The above query is running using the 30 partitions, but when I see the UI it is 
only using 1 partiton to run the query.

Can anyone tell if I am missing anything or do I need to anything else to tune 
the performance of the query.

Thanks


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


spark session jdbc performance

2017-10-24 Thread Madhire, Naveen
Hi,

I am trying to fetch data from Oracle DB using a subquery and experiencing lot 
of performance issues.

Below is the query I am using,

Using Spark 2.0.2

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "50")
.option("numPartitions", 30)
.load()

The above query is running using the 30 partitions, but when I see the UI it is 
only using 1 partiton to run the query.

Can anyone tell if I am missing anything or do I need to anything else to tune 
the performance of the query.

Thanks


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.