[ https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kamil Kliczbor updated SPARK-43999: ----------------------------------- Description: h2. Short problem description: I have two tables: * tab1 is empty * tab6 has milions of records * when Spark returns results due to empty database table tab1, it still asks for the tab6 data When I create the query that uses LEFT JOIN, the results are returned immediately, however under the hood the huge table is requested to return the results anyway. h2. Repro: h3. Prepare the MSSQL server database 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a named instance SQL2012. 2. Download and install [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16] (or any other tool) and run the following Query {code:sql} USE [master] GO CREATE DATABASE QueueSlots GO CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF GO USE [QueueSlots] GO CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo] GO {code} 3. Then create the tables and fill the tab6 with the data: {code:sql} CREATE TABLE tab1 (Id INT, Name NVARCHAR(50)) CREATE TABLE tab6 (Id INT, Name NVARCHAR(50)) insert into tab6 select o1.object_id as Id , o1.name as Name from sys.objects as o1 cross join sys.objects as o2 cross join sys.objects as o3 cross join sys.objects as o4 -- it might be required to increase the numer of the cross joins to increase the number of the rows, approximately 1 mln is enough - select should take several seconds {code} h3. Prepare Spark # Download mssql jdbc driver in version 12.2.0 # Run spark-shell2.cmd with the settings -cp "<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar" h3. Create temporary views on Spark {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext.sql(""" CREATE TEMPORARY VIEW tab1 USING org.apache.spark.sql.jdbc OPTIONS ( driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver', url 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;', dbtable 'dbo.Tab1', user 'spark', password 'spark' ) """) sqlContext.sql(""" CREATE TEMPORARY VIEW tab6 USING org.apache.spark.sql.jdbc OPTIONS ( driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver', url 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;', dbtable 'dbo.Tab6', user 'spark', password 'spark' ) """) {code} h3. Enable SQL Server Profiler tracing # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). Create new trace to the "QueueSlots" database. Use filtering options to see only queries issued for that database (Events Selection tab -> check Show all events and Show all columns, then click Column Filters -> DatabaseName like QueueSlots). # Run the trace h3. Run the query in Spark console # Run the following query {code:java} sqlContext.sql(""" SELECT t1.Id, t1.Name, t6.Name FROM tab1 as t1 LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id """).show {code} The results are returned immediately as: {code:java} +---+----+----+ | Id|Name|Name| +---+----+----+ +---+----+----+ [Stage 63:> (0 + 1) / 1] {code} h3. {color:#00875a}Expected{color} As the results are returned immediately for empty table, another sources are not queried. h3. {color:#de350b}Given:{color} The table6 is requested to return the data even though it is not being used and it is CPU and IO consuming operation. !image-2023-06-07-16-25-05-800.png! was: h2. Short problem description: I have two tables: * tab1 is empty * tab6 has milions of records * when Spark returns results due to empty database table it still asks for the When I create the query that uses LEFT JOIN, the results are returned immediately, however under the hood the huge table is requested to return the results anyway. h2. Repro: h3. Prepare the MSSQL server database 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a named instance SQL2012. 2. Download and install [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16] (or any other tool) and run the following Query {code:sql} USE [master] GO CREATE DATABASE QueueSlots GO CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF GO USE [QueueSlots] GO CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo] GO {code} 3. Then create the tables and fill the tab6 with the data: {code:sql} CREATE TABLE tab1 (Id INT, Name NVARCHAR(50)) CREATE TABLE tab6 (Id INT, Name NVARCHAR(50)) insert into tab6 select o1.object_id as Id , o1.name as Name from sys.objects as o1 cross join sys.objects as o2 cross join sys.objects as o3 cross join sys.objects as o4 -- it might be required to increase the numer of the cross joins to increase the number of the rows, approximately 1 mln is enough - select should take several seconds {code} h3. Prepare Spark # Download mssql jdbc driver in version 12.2.0 # Run spark-shell2.cmd with the settings -cp "<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar" h3. Create temporary views on Spark {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext.sql(""" CREATE TEMPORARY VIEW tab1 USING org.apache.spark.sql.jdbc OPTIONS ( driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver', url 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;', dbtable 'dbo.Tab1', user 'spark', password 'spark' ) """) sqlContext.sql(""" CREATE TEMPORARY VIEW tab6 USING org.apache.spark.sql.jdbc OPTIONS ( driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver', url 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;', dbtable 'dbo.Tab6', user 'spark', password 'spark' ) """) {code} h3. Enable SQL Server Profiler tracing # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). Create new trace to the "QueueSlots" database. Use filtering options to see only queries issued for that database (Events Selection tab -> check Show all events and Show all columns, then click Column Filters -> DatabaseName like QueueSlots). # Run the trace h3. Run the query in Spark console # Run the following query {code:java} sqlContext.sql(""" SELECT t1.Id, t1.Name, t6.Name FROM tab1 as t1 LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id """).show {code} The results are returned immediately as: {code:java} +---+----+----+ | Id|Name|Name| +---+----+----+ +---+----+----+ [Stage 63:> (0 + 1) / 1] {code} h3. {color:#00875a}Expected{color} As the results are returned immediately for empty table, another sources are not queried. h3. {color:#de350b}Given:{color} The table6 is requested to return the data even though it is not being used and it is CPU and IO consuming operation. !image-2023-06-07-16-25-05-800.png! > Data is still fetched even though result was returned > ----------------------------------------------------- > > Key: SPARK-43999 > URL: https://issues.apache.org/jira/browse/SPARK-43999 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.3.2 > Environment: Production > Reporter: Kamil Kliczbor > Priority: Major > > h2. Short problem description: > I have two tables: > * tab1 is empty > * tab6 has milions of records > * when Spark returns results due to empty database table tab1, it still asks > for the tab6 data > When I create the query that uses LEFT JOIN, the results are returned > immediately, however under the hood the huge table is requested to return the > results anyway. > h2. Repro: > h3. Prepare the MSSQL server database > 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a > named instance SQL2012. > 2. Download and install > [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16] > (or any other tool) and run the following Query > {code:sql} > USE [master] > GO > CREATE DATABASE QueueSlots > GO > CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], > CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF > GO > USE [QueueSlots] > GO > CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo] > GO > {code} > 3. Then create the tables and fill the tab6 with the data: > {code:sql} > CREATE TABLE tab1 (Id INT, Name NVARCHAR(50)) > CREATE TABLE tab6 (Id INT, Name NVARCHAR(50)) > insert into tab6 > select o1.object_id as Id , o1.name as Name > from sys.objects as o1 > cross join sys.objects as o2 > cross join sys.objects as o3 > cross join sys.objects as o4 > -- it might be required to increase the numer of the cross joins to increase > the number of the rows, approximately 1 mln is enough - select should take > several seconds > {code} > h3. Prepare Spark > # Download mssql jdbc driver in version 12.2.0 > # Run spark-shell2.cmd with the settings -cp > "<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar" > h3. Create temporary views on Spark > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > sqlContext.sql(""" > CREATE TEMPORARY VIEW tab1 > USING org.apache.spark.sql.jdbc > OPTIONS ( > driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver', > url > 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;', > > dbtable 'dbo.Tab1', > user 'spark', > password 'spark' > ) > """) > sqlContext.sql(""" > CREATE TEMPORARY VIEW tab6 > USING org.apache.spark.sql.jdbc > OPTIONS ( > driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver', > url > 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;', > > dbtable 'dbo.Tab6', > user 'spark', > password 'spark' > ) > """) > {code} > h3. Enable SQL Server Profiler tracing > # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). > Create new trace to the "QueueSlots" database. Use filtering options to see > only queries issued for that database (Events Selection tab -> check Show all > events and Show all columns, then click Column Filters -> DatabaseName like > QueueSlots). > # Run the trace > h3. Run the query in Spark console > # Run the following query > {code:java} > sqlContext.sql(""" > SELECT t1.Id, t1.Name, t6.Name > FROM tab1 as t1 > LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id > """).show > {code} > The results are returned immediately as: > {code:java} > +---+----+----+ > | Id|Name|Name| > +---+----+----+ > +---+----+----+ > [Stage 63:> (0 + 1) / > 1] > {code} > h3. {color:#00875a}Expected{color} > As the results are returned immediately for empty table, another sources are > not queried. > h3. {color:#de350b}Given:{color} > The table6 is requested to return the data even though it is not being used > and it is CPU and IO consuming operation. > > !image-2023-06-07-16-25-05-800.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org