Taking this of list Start here: https://github.com/apache/spark/blob/70ec696bce7012b25ed6d8acec5e2f3b3e127f11/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L144 Look at subclasses of JdbcDialect too, like TeradataDialect. Note that you are using an old unsupported version, too; that's a link to master.
On Fri, Nov 18, 2022 at 5:50 AM Ramakrishna Rayudu < ramakrishna560.ray...@gmail.com> wrote: > Hi Sean, > > Can you please let me know what is query spark internally fires for > getting count on dataframe. > > Long count=dataframe.count(); > > Is this > > SELECT 1 FROM ( QUERY) SUB_TABL > > and suming up the all 1s in the response. > Or directly > > SELECT COUNT(*) FROM (QUERY) > SUB_TABL > > Can you please what is approch spark will follow. > > > Thanks, > Ramakrishna Rayudu > > On Fri, Nov 18, 2022, 8:13 AM Ramakrishna Rayudu < > ramakrishna560.ray...@gmail.com> wrote: > >> Sure I will test with latest spark and let you the result. >> >> Thanks, >> Rama >> >> On Thu, Nov 17, 2022, 11:16 PM Sean Owen <sro...@gmail.com> wrote: >> >>> Weird, does Teradata not support LIMIT n? looking at the Spark source >>> code suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why >>> the generic query that seems to test existence loses the LIMIT. >>> But, that "SELECT 1" test seems to be used for MySQL, Postgres, so I'm >>> still not sure where it's coming from or if it's coming from Spark. You're >>> using the teradata dialect I assume. Can you use the latest Spark to test? >>> >>> On Thu, Nov 17, 2022 at 11:31 AM Ramakrishna Rayudu < >>> ramakrishna560.ray...@gmail.com> wrote: >>> >>>> Yes I am sure that we are not generating this kind of queries. Okay >>>> then problem is LIMIT is not coming up in query. Can you please suggest me >>>> any direction. >>>> >>>> Thanks, >>>> Rama >>>> >>>> On Thu, Nov 17, 2022, 10:56 PM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure >>>>> nothing else is generating or changing those queries? >>>>> >>>>> On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu < >>>>> ramakrishna560.ray...@gmail.com> wrote: >>>>> >>>>>> We are using spark 2.4.4 version. >>>>>> I can see two types of queries in DB logs. >>>>>> >>>>>> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0 >>>>>> >>>>>> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0 >>>>>> >>>>>> When we see `SELECT *` which ending up with `Where 1=0` but query >>>>>> starts with `SELECT 1` there is no where condition. >>>>>> >>>>>> Thanks, >>>>>> Rama >>>>>> >>>>>> On Thu, Nov 17, 2022, 10:39 PM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> Hm, actually that doesn't look like the queries that Spark uses to >>>>>>> test existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... >>>>>>> WHERE >>>>>>> 1=0" depending on the dialect. What version, and are you sure something >>>>>>> else is not sending those queries? >>>>>>> >>>>>>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu < >>>>>>> ramakrishna560.ray...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Sean, >>>>>>>> >>>>>>>> Thanks for your response I think it has the performance impact >>>>>>>> because if the query return one million rows then in the response It's >>>>>>>> self >>>>>>>> we will one million rows unnecessarily like below. >>>>>>>> >>>>>>>> 1 >>>>>>>> 1 >>>>>>>> 1 >>>>>>>> 1 >>>>>>>> . >>>>>>>> . >>>>>>>> 1 >>>>>>>> >>>>>>>> >>>>>>>> Its impact the performance. Can we any alternate solution for this. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Rama >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen <sro...@gmail.com> wrote: >>>>>>>> >>>>>>>>> This is a query to check the existence of the table upfront. >>>>>>>>> It is nearly a no-op query; can it have a perf impact? >>>>>>>>> >>>>>>>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu < >>>>>>>>> ramakrishna560.ray...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Team, >>>>>>>>>> >>>>>>>>>> I am facing one issue. Can you please help me on this. >>>>>>>>>> >>>>>>>>>> <https://stackoverflow.com/> >>>>>>>>>> >>>>>>>>>> 1. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> <https://stackoverflow.com/posts/74477662/timeline> >>>>>>>>>> >>>>>>>>>> We are connecting Tera data from spark SQL with below API >>>>>>>>>> >>>>>>>>>> Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, >>>>>>>>>> connectionProperties); >>>>>>>>>> >>>>>>>>>> when we execute above logic on large table with million rows every >>>>>>>>>> time we are seeing below >>>>>>>>>> >>>>>>>>>> extra query is executing every time as this resulting performance >>>>>>>>>> hit on DB. >>>>>>>>>> >>>>>>>>>> This below information we got from DBA. We dont have any logs on >>>>>>>>>> SPARK SQL. >>>>>>>>>> >>>>>>>>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE; >>>>>>>>>> >>>>>>>>>> 1 >>>>>>>>>> 1 >>>>>>>>>> 1 >>>>>>>>>> 1 >>>>>>>>>> 1 >>>>>>>>>> 1 >>>>>>>>>> 1 >>>>>>>>>> 1 >>>>>>>>>> 1 >>>>>>>>>> >>>>>>>>>> Can you please clarify why this query is executing or is there >>>>>>>>>> any chance that this type of query is executing from our code it >>>>>>>>>> self while >>>>>>>>>> check for rows count from dataframe. >>>>>>>>>> >>>>>>>>>> Please provide me your inputs on this. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Rama >>>>>>>>>> >>>>>>>>>