[ 
https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Kliczbor updated SPARK-43999:
-----------------------------------
    Description: 
h2. Short problem description:

I have two tables:
 * tab1 is empty
 * tab6 has milions of records
 * when Spark returns results due to empty database table tab1, it still asks 
for the tab6 data

When I create the query that uses LEFT JOIN, the results are returned 
immediately, however under the hood the huge table is requested to return the 
results anyway.
h2. Repro:
h3. Prepare the MSSQL server database 

1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a named 
instance SQL2012.
2. Download and install 
[SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
 (or any other tool) and run the following Query
{code:sql}
USE [master]
GO

CREATE DATABASE QueueSlots
GO

CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
GO

USE [QueueSlots]
GO

CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
GO
{code}
3. Then create the tables and fill the tab6 with the data:
{code:sql}
CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))

insert into tab6
select o1.object_id as Id , o1.name as Name
from sys.objects as o1 
cross join sys.objects as o2
cross join sys.objects as o3
cross join sys.objects as o4
-- it might be required to increase the numer of the cross joins to increase 
the number of the rows, approximately 1 mln is enough - select should take 
several seconds
{code}
h3. Prepare Spark
 # Download mssql jdbc driver in version 12.2.0
 # Run spark-shell2.cmd with the settings -cp 
"<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"

h3. Create temporary views on Spark
{code:java}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

sqlContext.sql("""
CREATE TEMPORARY VIEW tab1
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
  url 
'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
  
  dbtable 'dbo.Tab1',
  user 'spark',
  password 'spark'
)
""")

sqlContext.sql("""
CREATE TEMPORARY VIEW tab6
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
  url 
'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
  
  dbtable 'dbo.Tab6',
  user 'spark',
  password 'spark'
)
""")
{code}
h3. Enable SQL Server Profiler tracing
 # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
Create new trace to the "QueueSlots" database. Use filtering options to see 
only queries issued for that database (Events Selection tab -> check Show all 
events and Show all columns, then click Column Filters -> DatabaseName like 
QueueSlots).
 # Run the trace

h3. Run the query in Spark console
 # Run the following query
{code:java}
sqlContext.sql("""
SELECT t1.Id, t1.Name, t6.Name
  FROM tab1 as t1
  LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
""").show
{code}

The results are returned immediately as:
{code:java}
+---+----+----+
| Id|Name|Name|
+---+----+----+
+---+----+----+

[Stage 63:>                                                         (0 + 1) / 1]
{code}
h3. {color:#00875a}Expected{color}

As the results are returned immediately for empty table, another sources are 
not queried.
h3. {color:#de350b}Given:{color}

The table6 is requested to return the data even though it is not being used and 
it is CPU and IO consuming operation.

 

!image-2023-06-07-16-25-05-800.png!

  was:
h2. Short problem description:

I have two tables:
 * tab1 is empty
 * tab6 has milions of records
 * when Spark returns results due to empty database table it still asks for the 

When I create the query that uses LEFT JOIN, the results are returned 
immediately, however under the hood the huge table is requested to return the 
results anyway.
h2. Repro:
h3. Prepare the MSSQL server database 

1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a named 
instance SQL2012.
2. Download and install 
[SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
 (or any other tool) and run the following Query
{code:sql}
USE [master]
GO

CREATE DATABASE QueueSlots
GO

CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
GO

USE [QueueSlots]
GO

CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
GO
{code}
3. Then create the tables and fill the tab6 with the data:
{code:sql}
CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))

insert into tab6
select o1.object_id as Id , o1.name as Name
from sys.objects as o1 
cross join sys.objects as o2
cross join sys.objects as o3
cross join sys.objects as o4
-- it might be required to increase the numer of the cross joins to increase 
the number of the rows, approximately 1 mln is enough - select should take 
several seconds
{code}
h3. Prepare Spark
 # Download mssql jdbc driver in version 12.2.0
 # Run spark-shell2.cmd with the settings -cp 
"<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"

h3. Create temporary views on Spark
{code:java}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

sqlContext.sql("""
CREATE TEMPORARY VIEW tab1
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
  url 
'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
  
  dbtable 'dbo.Tab1',
  user 'spark',
  password 'spark'
)
""")

sqlContext.sql("""
CREATE TEMPORARY VIEW tab6
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
  url 
'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
  
  dbtable 'dbo.Tab6',
  user 'spark',
  password 'spark'
)
""")
{code}
h3. Enable SQL Server Profiler tracing
 # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
Create new trace to the "QueueSlots" database. Use filtering options to see 
only queries issued for that database (Events Selection tab -> check Show all 
events and Show all columns, then click Column Filters -> DatabaseName like 
QueueSlots).
 # Run the trace

h3. Run the query in Spark console
 # Run the following query
{code:java}
sqlContext.sql("""
SELECT t1.Id, t1.Name, t6.Name
  FROM tab1 as t1
  LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
""").show
{code}

The results are returned immediately as:
{code:java}
+---+----+----+
| Id|Name|Name|
+---+----+----+
+---+----+----+

[Stage 63:>                                                         (0 + 1) / 1]
{code}
h3. {color:#00875a}Expected{color}

As the results are returned immediately for empty table, another sources are 
not queried.
h3. {color:#de350b}Given:{color}

The table6 is requested to return the data even though it is not being used and 
it is CPU and IO consuming operation.

 

!image-2023-06-07-16-25-05-800.png!


> Data is still fetched even though result was returned
> -----------------------------------------------------
>
>                 Key: SPARK-43999
>                 URL: https://issues.apache.org/jira/browse/SPARK-43999
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.2
>         Environment: Production
>            Reporter: Kamil Kliczbor
>            Priority: Major
>
> h2. Short problem description:
> I have two tables:
>  * tab1 is empty
>  * tab6 has milions of records
>  * when Spark returns results due to empty database table tab1, it still asks 
> for the tab6 data
> When I create the query that uses LEFT JOIN, the results are returned 
> immediately, however under the hood the huge table is requested to return the 
> results anyway.
> h2. Repro:
> h3. Prepare the MSSQL server database 
> 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a 
> named instance SQL2012.
> 2. Download and install 
> [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
>  (or any other tool) and run the following Query
> {code:sql}
> USE [master]
> GO
> CREATE DATABASE QueueSlots
> GO
> CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
> CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
> GO
> USE [QueueSlots]
> GO
> CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
> GO
> {code}
> 3. Then create the tables and fill the tab6 with the data:
> {code:sql}
> CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
> CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))
> insert into tab6
> select o1.object_id as Id , o1.name as Name
> from sys.objects as o1 
> cross join sys.objects as o2
> cross join sys.objects as o3
> cross join sys.objects as o4
> -- it might be required to increase the numer of the cross joins to increase 
> the number of the rows, approximately 1 mln is enough - select should take 
> several seconds
> {code}
> h3. Prepare Spark
>  # Download mssql jdbc driver in version 12.2.0
>  # Run spark-shell2.cmd with the settings -cp 
> "<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"
> h3. Create temporary views on Spark
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab1
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab1',
>   user 'spark',
>   password 'spark'
> )
> """)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab6
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab6',
>   user 'spark',
>   password 'spark'
> )
> """)
> {code}
> h3. Enable SQL Server Profiler tracing
>  # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
> Create new trace to the "QueueSlots" database. Use filtering options to see 
> only queries issued for that database (Events Selection tab -> check Show all 
> events and Show all columns, then click Column Filters -> DatabaseName like 
> QueueSlots).
>  # Run the trace
> h3. Run the query in Spark console
>  # Run the following query
> {code:java}
> sqlContext.sql("""
> SELECT t1.Id, t1.Name, t6.Name
>   FROM tab1 as t1
>   LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
> """).show
> {code}
> The results are returned immediately as:
> {code:java}
> +---+----+----+
> | Id|Name|Name|
> +---+----+----+
> +---+----+----+
> [Stage 63:>                                                         (0 + 1) / 
> 1]
> {code}
> h3. {color:#00875a}Expected{color}
> As the results are returned immediately for empty table, another sources are 
> not queried.
> h3. {color:#de350b}Given:{color}
> The table6 is requested to return the data even though it is not being used 
> and it is CPU and IO consuming operation.
>  
> !image-2023-06-07-16-25-05-800.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to