[jira] [Commented] (SPARK-43999) Data is still fetched even though result was returned

Jia Fan (Jira) Mon, 26 Jun 2023 00:34:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737026#comment-17737026
 ]


Jia Fan commented on SPARK-43999:
---------------------------------

The reason are when AQE on, the small empty table alway return faster when join 
on two table. When get the left result, the AQE optimizer will use stats to 
reOptimize plan, so the right table result will be unnecessary. The result 
return, but the right table query stage will not be cancel or forch finish at 
now. 

> Data is still fetched even though result was returned
> -----------------------------------------------------
>
>                 Key: SPARK-43999
>                 URL: https://issues.apache.org/jira/browse/SPARK-43999
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.0
>         Environment: Production
>            Reporter: Kamil Kliczbor
>            Priority: Major
>         Attachments: Profiler.PNG
>
>
> h2. Short problem description:
> I have two tables:
>  * tab1 is empty
>  * tab6 has milions of records
>  * when Spark returns results due to empty database table tab1, it still asks 
> for the tab6 data
> When I create the query that uses LEFT JOIN, the results are returned 
> immediately, however under the hood the huge table is requested to return the 
> results anyway.
> h2. Repro:
> h3. Prepare the MSSQL server database 
> 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a 
> named instance SQL2012.
> 2. Download and install 
> [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
>  (or any other tool) and run the following Query
> {code:sql}
> USE [master]
> GO
> CREATE DATABASE QueueSlots
> GO
> CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
> CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
> GO
> USE [QueueSlots]
> GO
> CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
> GO
> {code}
> 3. Then create the tables and fill the tab6 with the data:
> {code:sql}
> CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
> CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))
> insert into tab6
> select o1.object_id as Id , o1.name as Name
> from sys.objects as o1 
> cross join sys.objects as o2
> cross join sys.objects as o3
> cross join sys.objects as o4
> -- it might be required to increase the numer of the cross joins to increase 
> the number of the rows, approximately 1 mln is enough - select should take 
> several seconds
> {code}
> h3. Prepare Spark
>  # Download mssql jdbc driver in version 12.2.0
>  # Run spark-shell2.cmd with the settings -cp 
> "<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"
> h3. Create temporary views on Spark
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab1
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab1',
>   user 'spark',
>   password 'spark'
> )
> """)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab6
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab6',
>   user 'spark',
>   password 'spark'
> )
> """)
> {code}
> h3. Enable SQL Server Profiler tracing
>  # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
> Create new trace to the "QueueSlots" database. Use filtering options to see 
> only queries issued for that database (Events Selection tab -> check Show all 
> events and Show all columns, then click Column Filters -> DatabaseName like 
> QueueSlots).
>  # Run the trace
> h3. Run the query in Spark console
>  # Run the following query
> {code:java}
> sqlContext.sql("""
> SELECT t1.Id, t1.Name, t6.Name
>   FROM tab1 as t1
>   LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
> """).show
> {code}
> The results are returned immediately as:
> {code:java}
> +---+----+----+
> | Id|Name|Name|
> +---+----+----+
> +---+----+----+
> [Stage 63:>                                                         (0 + 1) / 
> 1]
> {code}
> h3. {color:#00875a}Expected{color}
> As the results are returned immediately for empty table, another sources are 
> not queried.
> h3. {color:#de350b}Given:{color}
> The table6 is requested to return the data even though it is not being used 
> and it is CPU and IO consuming operation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43999) Data is still fetched even though result was returned

Reply via email to