Re: spark session jdbc performance
Hi Naveen, Can you please copy and paste the lines in your original email again, and perhaps then Lucas can go through it completely & kindly stop thinking that others are responding by assuming things? On other hand, please try to let me know how things are going on, there is another post on this a few weeks back and several other users are equally finding this issue very interesting to resolve. I might just have the solution for this. Regards, Gourav Sengupta On Wed, Oct 25, 2017 at 9:26 PM, lucas.g...@gmail.com wrote: > Are we seeing the UI is showing only one partition to run the query? The > original poster hasn't replied yet. > > My assumption is that there's only one executor configured / deployed. > But we only know what the OP stated which wasn't enough to be sure of > anything. > > Why are you suggesting that partitioning on the PK isn't prudent? and did > you mean to say that 30 partitions were far to many for any system to > handle? (I'm assuming you misread the original code) > > Gary > > On 25 October 2017 at 13:21, Gourav Sengupta > wrote: > >> Hi Lucas, >> >> so if I am assuming things, can you please explain why the UI is showing >> only one partition to run the query? >> >> >> Regards, >> Gourav Sengupta >> >> On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com < >> lucas.g...@gmail.com> wrote: >> >>> Gourav, I'm assuming you misread the code. It's 30 partitions, which >>> isn't a ridiculous value. Maybe you misread the upperBound for the >>> partitions? (That would be ridiculous) >>> >>> Why not use the PK as the partition column? Obviously it depends on the >>> downstream queries. If you're going to be performing joins (which I assume >>> is the case) then partitioning on the join column would be advisable, but >>> what about the case where the join column would be heavily skewed? >>> >>> Thanks! >>> >>> Gary >>> >>> On 24 October 2017 at 23:41, Gourav Sengupta >>> wrote: >>> Hi Naveen, I do not think that it is prudent to use the PK as the partitionColumn. That is too many partitions for any system to handle. The numPartitions will be valid in case of JDBC very differently. Please keep me updated on how things go. Regards, Gourav Sengupta On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire >>> > wrote: > > Hi, > > > > I am trying to fetch data from Oracle DB using a subquery and > experiencing lot of performance issues. > > > > Below is the query I am using, > > > > *Using Spark 2.0.2* > > > > *val *df = spark_session.read.format(*"jdbc"*) > .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) > .option(*"url"*, jdbc_url) >.option(*"user"*, user) >.option(*"password"*, pwd) >.option(*"dbtable"*, *"subquery"*) >.option(*"partitionColumn"*, *"id"*) //primary key column > uniformly distributed >.option(*"lowerBound"*, *"1"*) >.option(*"upperBound"*, *"50"*) > .option(*"numPartitions"*, 30) > .load() > > > > The above query is running using the 30 partitions, but when I see the > UI it is only using 1 partiton to run the query. > > > > Can anyone tell if I am missing anything or do I need to anything else > to tune the performance of the query. > > *Thanks* > >>> >> >
Re: spark session jdbc performance
Are we seeing the UI is showing only one partition to run the query? The original poster hasn't replied yet. My assumption is that there's only one executor configured / deployed. But we only know what the OP stated which wasn't enough to be sure of anything. Why are you suggesting that partitioning on the PK isn't prudent? and did you mean to say that 30 partitions were far to many for any system to handle? (I'm assuming you misread the original code) Gary On 25 October 2017 at 13:21, Gourav Sengupta wrote: > Hi Lucas, > > so if I am assuming things, can you please explain why the UI is showing > only one partition to run the query? > > > Regards, > Gourav Sengupta > > On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com < > lucas.g...@gmail.com> wrote: > >> Gourav, I'm assuming you misread the code. It's 30 partitions, which >> isn't a ridiculous value. Maybe you misread the upperBound for the >> partitions? (That would be ridiculous) >> >> Why not use the PK as the partition column? Obviously it depends on the >> downstream queries. If you're going to be performing joins (which I assume >> is the case) then partitioning on the join column would be advisable, but >> what about the case where the join column would be heavily skewed? >> >> Thanks! >> >> Gary >> >> On 24 October 2017 at 23:41, Gourav Sengupta >> wrote: >> >>> Hi Naveen, >>> >>> I do not think that it is prudent to use the PK as the partitionColumn. >>> That is too many partitions for any system to handle. The numPartitions >>> will be valid in case of JDBC very differently. >>> >>> Please keep me updated on how things go. >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire >>> wrote: >>> Hi, I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues. Below is the query I am using, *Using Spark 2.0.2* *val *df = spark_session.read.format(*"jdbc"*) .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) .option(*"url"*, jdbc_url) .option(*"user"*, user) .option(*"password"*, pwd) .option(*"dbtable"*, *"subquery"*) .option(*"partitionColumn"*, *"id"*) //primary key column uniformly distributed .option(*"lowerBound"*, *"1"*) .option(*"upperBound"*, *"50"*) .option(*"numPartitions"*, 30) .load() The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query. Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query. *Thanks* >>> >>> >> >
Re: spark session jdbc performance
Hi Lucas, so if I am assuming things, can you please explain why the UI is showing only one partition to run the query? Regards, Gourav Sengupta On Wed, Oct 25, 2017 at 6:03 PM, lucas.g...@gmail.com wrote: > Gourav, I'm assuming you misread the code. It's 30 partitions, which > isn't a ridiculous value. Maybe you misread the upperBound for the > partitions? (That would be ridiculous) > > Why not use the PK as the partition column? Obviously it depends on the > downstream queries. If you're going to be performing joins (which I assume > is the case) then partitioning on the join column would be advisable, but > what about the case where the join column would be heavily skewed? > > Thanks! > > Gary > > On 24 October 2017 at 23:41, Gourav Sengupta > wrote: > >> Hi Naveen, >> >> I do not think that it is prudent to use the PK as the partitionColumn. >> That is too many partitions for any system to handle. The numPartitions >> will be valid in case of JDBC very differently. >> >> Please keep me updated on how things go. >> >> >> Regards, >> Gourav Sengupta >> >> On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire >> wrote: >> >>> >>> Hi, >>> >>> >>> >>> I am trying to fetch data from Oracle DB using a subquery and >>> experiencing lot of performance issues. >>> >>> >>> >>> Below is the query I am using, >>> >>> >>> >>> *Using Spark 2.0.2* >>> >>> >>> >>> *val *df = spark_session.read.format(*"jdbc"*) >>> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) >>> .option(*"url"*, jdbc_url) >>>.option(*"user"*, user) >>>.option(*"password"*, pwd) >>>.option(*"dbtable"*, *"subquery"*) >>>.option(*"partitionColumn"*, *"id"*) //primary key column uniformly >>> distributed >>>.option(*"lowerBound"*, *"1"*) >>>.option(*"upperBound"*, *"50"*) >>> .option(*"numPartitions"*, 30) >>> .load() >>> >>> >>> >>> The above query is running using the 30 partitions, but when I see the >>> UI it is only using 1 partiton to run the query. >>> >>> >>> >>> Can anyone tell if I am missing anything or do I need to anything else >>> to tune the performance of the query. >>> >>> *Thanks* >>> >> >> >
Re: spark session jdbc performance
Gourav, I'm assuming you misread the code. It's 30 partitions, which isn't a ridiculous value. Maybe you misread the upperBound for the partitions? (That would be ridiculous) Why not use the PK as the partition column? Obviously it depends on the downstream queries. If you're going to be performing joins (which I assume is the case) then partitioning on the join column would be advisable, but what about the case where the join column would be heavily skewed? Thanks! Gary On 24 October 2017 at 23:41, Gourav Sengupta wrote: > Hi Naveen, > > I do not think that it is prudent to use the PK as the partitionColumn. > That is too many partitions for any system to handle. The numPartitions > will be valid in case of JDBC very differently. > > Please keep me updated on how things go. > > > Regards, > Gourav Sengupta > > On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire > wrote: > >> >> Hi, >> >> >> >> I am trying to fetch data from Oracle DB using a subquery and >> experiencing lot of performance issues. >> >> >> >> Below is the query I am using, >> >> >> >> *Using Spark 2.0.2* >> >> >> >> *val *df = spark_session.read.format(*"jdbc"*) >> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) >> .option(*"url"*, jdbc_url) >>.option(*"user"*, user) >>.option(*"password"*, pwd) >>.option(*"dbtable"*, *"subquery"*) >>.option(*"partitionColumn"*, *"id"*) //primary key column uniformly >> distributed >>.option(*"lowerBound"*, *"1"*) >>.option(*"upperBound"*, *"50"*) >> .option(*"numPartitions"*, 30) >> .load() >> >> >> >> The above query is running using the 30 partitions, but when I see the UI >> it is only using 1 partiton to run the query. >> >> >> >> Can anyone tell if I am missing anything or do I need to anything else to >> tune the performance of the query. >> >> *Thanks* >> > >
Re: spark session jdbc performance
Hi Naveen, I do not think that it is prudent to use the PK as the partitionColumn. That is too many partitions for any system to handle. The numPartitions will be valid in case of JDBC very differently. Please keep me updated on how things go. Regards, Gourav Sengupta On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire wrote: > > Hi, > > > > I am trying to fetch data from Oracle DB using a subquery and experiencing > lot of performance issues. > > > > Below is the query I am using, > > > > *Using Spark 2.0.2* > > > > *val *df = spark_session.read.format(*"jdbc"*) > .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) > .option(*"url"*, jdbc_url) >.option(*"user"*, user) >.option(*"password"*, pwd) >.option(*"dbtable"*, *"subquery"*) >.option(*"partitionColumn"*, *"id"*) //primary key column uniformly > distributed >.option(*"lowerBound"*, *"1"*) >.option(*"upperBound"*, *"50"*) > .option(*"numPartitions"*, 30) > .load() > > > > The above query is running using the 30 partitions, but when I see the UI > it is only using 1 partiton to run the query. > > > > Can anyone tell if I am missing anything or do I need to anything else to > tune the performance of the query. > > *Thanks* >
Re: spark session jdbc performance
Hi, is the subquery is user defined sqls or table name in db.If it is user Defined sql.Make sure ur partition column is in main select clause. Sent from Yahoo Mail on Android On Wed, Oct 25, 2017 at 3:25, Naveen Madhire wrote: Hi, I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues. Below is the query I am using, Using Spark 2.0.2 val df = spark_session.read.format("jdbc") .option("driver","oracle.jdbc.OracleDriver") .option("url", jdbc_url) .option("user", user) .option("password", pwd) .option("dbtable", "subquery") .option("partitionColumn", "id") //primary key column uniformly distributed .option("lowerBound", "1") .option("upperBound", "50") .option("numPartitions", 30) .load() The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query. Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query. Thanks
Re: spark session jdbc performance
Sorry, I meant to say: "That code looks SANE to me" Assuming that you're seeing the query running partitioned as expected then you're likely configured with one executor. Very easy to check in the UI. Gary Lucas On 24 October 2017 at 16:09, lucas.g...@gmail.com wrote: > Did you check the query plan / check the UI? > > That code looks same to me. Maybe you've only configured for one executor? > > Gary > > On Oct 24, 2017 2:55 PM, "Naveen Madhire" wrote: > >> >> Hi, >> >> >> >> I am trying to fetch data from Oracle DB using a subquery and >> experiencing lot of performance issues. >> >> >> >> Below is the query I am using, >> >> >> >> *Using Spark 2.0.2* >> >> >> >> *val *df = spark_session.read.format(*"jdbc"*) >> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) >> .option(*"url"*, jdbc_url) >>.option(*"user"*, user) >>.option(*"password"*, pwd) >>.option(*"dbtable"*, *"subquery"*) >>.option(*"partitionColumn"*, *"id"*) //primary key column uniformly >> distributed >>.option(*"lowerBound"*, *"1"*) >>.option(*"upperBound"*, *"50"*) >> .option(*"numPartitions"*, 30) >> .load() >> >> >> >> The above query is running using the 30 partitions, but when I see the UI >> it is only using 1 partiton to run the query. >> >> >> >> Can anyone tell if I am missing anything or do I need to anything else to >> tune the performance of the query. >> >> *Thanks* >> >
Re: spark session jdbc performance
Did you check the query plan / check the UI? That code looks same to me. Maybe you've only configured for one executor? Gary On Oct 24, 2017 2:55 PM, "Naveen Madhire" wrote: > > Hi, > > > > I am trying to fetch data from Oracle DB using a subquery and experiencing > lot of performance issues. > > > > Below is the query I am using, > > > > *Using Spark 2.0.2* > > > > *val *df = spark_session.read.format(*"jdbc"*) > .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) > .option(*"url"*, jdbc_url) >.option(*"user"*, user) >.option(*"password"*, pwd) >.option(*"dbtable"*, *"subquery"*) >.option(*"partitionColumn"*, *"id"*) //primary key column uniformly > distributed >.option(*"lowerBound"*, *"1"*) >.option(*"upperBound"*, *"50"*) > .option(*"numPartitions"*, 30) > .load() > > > > The above query is running using the 30 partitions, but when I see the UI > it is only using 1 partiton to run the query. > > > > Can anyone tell if I am missing anything or do I need to anything else to > tune the performance of the query. > > *Thanks* >
spark session jdbc performance
Hi, I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues. Below is the query I am using, *Using Spark 2.0.2* *val *df = spark_session.read.format(*"jdbc"*) .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) .option(*"url"*, jdbc_url) .option(*"user"*, user) .option(*"password"*, pwd) .option(*"dbtable"*, *"subquery"*) .option(*"partitionColumn"*, *"id"*) //primary key column uniformly distributed .option(*"lowerBound"*, *"1"*) .option(*"upperBound"*, *"50"*) .option(*"numPartitions"*, 30) .load() The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query. Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query. *Thanks*
spark session jdbc performance
Hi, I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues. Below is the query I am using, Using Spark 2.0.2 val df = spark_session.read.format("jdbc") .option("driver","oracle.jdbc.OracleDriver") .option("url", jdbc_url) .option("user", user) .option("password", pwd) .option("dbtable", "subquery") .option("partitionColumn", "id") //primary key column uniformly distributed .option("lowerBound", "1") .option("upperBound", "50") .option("numPartitions", 30) .load() The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query. Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query. Thanks The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
spark session jdbc performance
Hi, I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues. Below is the query I am using, Using Spark 2.0.2 val df = spark_session.read.format("jdbc") .option("driver","oracle.jdbc.OracleDriver") .option("url", jdbc_url) .option("user", user) .option("password", pwd) .option("dbtable", "subquery") .option("partitionColumn", "id") //primary key column uniformly distributed .option("lowerBound", "1") .option("upperBound", "50") .option("numPartitions", 30) .load() The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query. Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query. Thanks The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.