Riccardo Delegà created SPARK-31370:
---------------------------------------

             Summary: AnalysisException when too many columns in join
                 Key: SPARK-31370
                 URL: https://issues.apache.org/jira/browse/SPARK-31370
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 2.4.4
            Reporter: Riccardo Delegà


Python version: '3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 
21:48:41) [MSC v.1916 64 bit (AMD64)]'

I'm getting a weird problem when I'm joining tables with too many columns.

 
{code:java}
from pyspark.sql import SparkSession
import  pyspark.sql.functions as F

number_of_columns = 50

spark = SparkSession.builder.getOrCreate()

base_df = spark.range(1,100)

secondary_id_df = base_df\
    .withColumn('id1', F.col('id'))\
    .withColumn('id2', F.col('id'))\
    .withColumn('id3', F.col('id'))\
    .withColumn('id4', F.col('id'))

sales_df = base_df
for i in range(1, number_of_columns):
    sales_df = sales_df.withColumn(f'kpi{i}', (F.rand()*100000).cast("int"))

sales_df.registerTempTable('sales')
secondary_id_df.registerTempTable('secondary_id'){code}
 

If I run the following query:

 
{code:java}
spark.sql("""
SELECT
    *
FROM
    secondary_id
    LEFT OUTER JOIN sales s0 ON secondary_id.id = s0.id
    LEFT OUTER JOIN sales s1 ON secondary_id.id1 = s1.id
    LEFT OUTER JOIN sales s2 ON secondary_id.id2 = s2.id
""")
{code}
 

I get the following exception:

 
{code:java}
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyspark\sql\utils.py in 
deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\py4j\protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o22.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve '`secondary_id.id2`' 
given input columns: [s0.kpi41, s0.kpi5, s1.kpi25, s1.kpi13, s2.kpi14, 
s2.kpi48, s2.kpi20, s1.id, s1.kpi12, s1.kpi43, s1.kpi32, s2.kpi8, s0.kpi1, 
s1.kpi28, s0.kpi28, s0.kpi18, s0.kpi16, secondary_id.id1, s2.kpi7, s0.kpi2, 
s0.kpi7, s2.kpi12, s1.kpi36, s0.kpi4, s2.kpi41, s1.kpi34, s0.kpi34, s2.kpi34, 
s2.kpi32, s1.kpi40, s1.kpi39, s0.kpi15, s1.kpi2, s0.kpi43, s0.kpi26, s1.kpi1, 
s1.kpi37, s2.kpi37, s1.kpi46, s1.kpi47, s1.kpi41, s1.kpi18, s1.kpi35, s2.kpi10, 
s2.kpi1, s2.kpi49, s1.kpi17, s2.kpi39, s0.kpi47, s0.kpi37, s0.kpi13, s2.kpi30, 
s1.kpi6, s1.kpi45, s2.kpi25, s1.kpi30, s2.kpi33, s2.kpi26, s2.kpi3, s2.kpi23, 
s0.kpi27, s0.kpi32, s0.kpi24, s1.kpi21, s1.kpi8, s1.kpi33, s2.kpi31, s2.kpi2, 
s2.kpi45, s0.kpi46, s0.kpi30, s2.kpi15, s0.kpi23, s0.kpi25, s2.kpi42, s0.kpi19, 
s2.kpi22, s0.kpi29, s1.kpi9, s2.kpi46, s1.kpi10, s1.kpi14, s1.kpi7, s0.kpi3, 
s0.kpi45, s2.kpi21, s2.kpi47, s2.kpi40, s0.kpi14, s2.kpi28, s1.kpi15, s1.kpi24, 
s2.kpi19, s2.kpi38, s1.kpi42, s2.kpi29, s0.kpi39, s0.kpi6, s0.kpi9, s1.kpi20, 
s2.kpi6, s1.kpi11, s0.kpi12, secondary_id.id4, s0.id, s0.kpi44, s0.kpi11, 
s0.kpi20, s2.kpi27, s0.kpi42, s0.kpi33, s2.kpi16, s1.kpi26, s2.kpi43, s0.kpi48, 
s0.kpi8, secondary_id.id2, s1.kpi49, s1.kpi4, s1.kpi48, s1.kpi31, s2.kpi35, 
s2.kpi17, s0.kpi10, s1.kpi16, s1.kpi22, s1.kpi29, s2.kpi11, s0.kpi49, s0.kpi40, 
s2.kpi5, s2.kpi4, s1.kpi3, s2.kpi13, secondary_id.id, s1.kpi27, s2.kpi9, 
s2.kpi36, s0.kpi21, s2.kpi44, s1.kpi44, s0.kpi35, s1.kpi19, s2.kpi18, s0.kpi36, 
s0.kpi17, s0.kpi38, secondary_id.id3, s0.kpi31, s1.kpi38, s0.kpi22, s2.kpi24, 
s1.kpi5, s1.kpi23]; line 8 pos 32;
{code}
 

even though the column "secondary_id.id2" is present in the input columns list.

If I remove the last join, the query runs fine:
{code:java}
spark.sql("""
SELECT
    *
FROM
    secondary_id
    LEFT OUTER JOIN sales s0 ON secondary_id.id = s0.id
    LEFT OUTER JOIN sales s1 ON secondary_id.id1 = s1.id
""")
{code}
At the same time, if I reimplement the initial query using Spark APIs, the code 
doesn't raise an exception: 
{code:java}
secondary_id_df.alias('secondary_id')\
    .join(sales_df.alias('s0'), on=col('secondary_id.id')==col('s0.id'))\
    .join(sales_df.alias('s1'), on=col('secondary_id.id')==col('s1.id'))\
    .join(sales_df.alias('s2'), on=col('secondary_id.id')==col('s2.id'))
{code}
Raising the parameter "number_of_columns" to 150 makes this piece of code raise 
an exception as well, so the bug seems connected to the number of columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to