Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Sean Owen Fri, 18 Nov 2022 05:12:23 -0800

Taking this of list

Start here:
https://github.com/apache/spark/blob/70ec696bce7012b25ed6d8acec5e2f3b3e127f11/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L144
Look at subclasses of JdbcDialect too, like TeradataDialect.
Note that you are using an old unsupported version, too; that's a link to
master.


On Fri, Nov 18, 2022 at 5:50 AM Ramakrishna Rayudu <
ramakrishna560.ray...@gmail.com> wrote:

> Hi Sean,
>
> Can you please let me know what is query spark internally fires for
> getting count on dataframe.
>
> Long count=dataframe.count();
>
> Is this
>
> SELECT 1 FROM ( QUERY) SUB_TABL
>
> and suming up the all 1s in the response.
> Or directly
>
> SELECT COUNT(*) FROM (QUERY)
> SUB_TABL
>
> Can you please what is approch spark will follow.
>
>
> Thanks,
> Ramakrishna Rayudu
>
> On Fri, Nov 18, 2022, 8:13 AM Ramakrishna Rayudu <
> ramakrishna560.ray...@gmail.com> wrote:
>
>> Sure I will test with latest spark and let you the result.
>>
>> Thanks,
>> Rama
>>
>> On Thu, Nov 17, 2022, 11:16 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> Weird, does Teradata not support LIMIT n? looking at the Spark source
>>> code suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why
>>> the generic query that seems to test existence loses the LIMIT.
>>> But, that "SELECT 1" test seems to be used for MySQL, Postgres, so I'm
>>> still not sure where it's coming from or if it's coming from Spark. You're
>>> using the teradata dialect I assume. Can you use the latest Spark to test?
>>>
>>> On Thu, Nov 17, 2022 at 11:31 AM Ramakrishna Rayudu <
>>> ramakrishna560.ray...@gmail.com> wrote:
>>>
>>>> Yes I am sure that we are not generating this kind of queries. Okay
>>>> then problem is  LIMIT is not coming up in query. Can you please suggest me
>>>> any direction.
>>>>
>>>> Thanks,
>>>> Rama
>>>>
>>>> On Thu, Nov 17, 2022, 10:56 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure
>>>>> nothing else is generating or changing those queries?
>>>>>
>>>>> On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
>>>>> ramakrishna560.ray...@gmail.com> wrote:
>>>>>
>>>>>> We are using spark 2.4.4 version.
>>>>>> I can see two types of queries in DB logs.
>>>>>>
>>>>>> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>>>>>>
>>>>>> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>>>>>>
>>>>>> When we see `SELECT *` which ending up with `Where 1=0`  but query
>>>>>> starts with `SELECT 1` there is no where condition.
>>>>>>
>>>>>> Thanks,
>>>>>> Rama
>>>>>>
>>>>>> On Thu, Nov 17, 2022, 10:39 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> Hm, actually that doesn't look like the queries that Spark uses to
>>>>>>> test existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... 
>>>>>>> WHERE
>>>>>>> 1=0" depending on the dialect. What version, and are you sure something
>>>>>>> else is not sending those queries?
>>>>>>>
>>>>>>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>>>>>>> ramakrishna560.ray...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Sean,
>>>>>>>>
>>>>>>>> Thanks for your response I think it has the performance impact
>>>>>>>> because if the query return one million rows then in the response It's 
>>>>>>>> self
>>>>>>>> we will one million rows unnecessarily like below.
>>>>>>>>
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> .
>>>>>>>> .
>>>>>>>> 1
>>>>>>>>
>>>>>>>>
>>>>>>>> Its impact the performance. Can we any alternate solution for this.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rama
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> This is a query to check the existence of the table upfront.
>>>>>>>>> It is nearly a no-op query; can it have a perf impact?
>>>>>>>>>
>>>>>>>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>>>>>>>>> ramakrishna560.ray...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Team,
>>>>>>>>>>
>>>>>>>>>> I am facing one issue. Can you please help me on this.
>>>>>>>>>>
>>>>>>>>>> <https://stackoverflow.com/>
>>>>>>>>>>
>>>>>>>>>>    1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>>>>>>>>
>>>>>>>>>> We are connecting Tera data from spark SQL with below API
>>>>>>>>>>
>>>>>>>>>> Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, 
>>>>>>>>>> connectionProperties);
>>>>>>>>>>
>>>>>>>>>> when we execute above logic on large table with million rows every 
>>>>>>>>>> time we are seeing below
>>>>>>>>>>
>>>>>>>>>> extra query is executing every time as this resulting performance 
>>>>>>>>>> hit on DB.
>>>>>>>>>>
>>>>>>>>>> This below information we got from DBA. We dont have any logs on
>>>>>>>>>> SPARK SQL.
>>>>>>>>>>
>>>>>>>>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>>>>>>>>
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>>
>>>>>>>>>> Can you please clarify why this query is executing or is there
>>>>>>>>>> any chance that this type of query is executing from our code it 
>>>>>>>>>> self while
>>>>>>>>>> check for rows count from dataframe.
>>>>>>>>>>
>>>>>>>>>> Please provide me your inputs on this.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Rama
>>>>>>>>>>
>>>>>>>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Reply via email to