[jira] [Updated] (SPARK-47034) join between cached temp tables result in missing entries

shurik mermelshtein (Jira) Tue, 13 Feb 2024 11:17:27 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-47034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


shurik mermelshtein updated SPARK-47034:
----------------------------------------
    Description: 
we create several temp tables (views) by loading several delta tables and 
joining between them. 
those views are used for calculation of different metrics. each metric requires 
different views to be used. some of the more popular views are cached for 
better performance. 

we have noticed that once we upgraded from spark 3.4.2  to spark 3.5.0 some of 
the join started to fail.

we can reproduce a case were we have 2 data frames (views) (this is not the 
real names  / values we use. this is just for the example)
 # users with the column user_id, campaign_id, user_name.
we make sure it has a single entry
'111111', '22222', 'Jhon Doe'
 # actions with the column user_id, campaign_id, action_id, action count
we make sure it has a single entry
'111111', '22222', 'clicks', 5

 
 # users view can be filtered for user_id = '111111' or/and campaign_id = 
'22222' and it will find the existing single row
 # actions view can be filtered for user_id = '111111' or/and campaign_id = 
'22222' and it will find the existing single row
 # users and actions can be inner join by user_id *OR* campaign_id and the join 
will be successful. 
 # users and actions can *not* be inner join by user_id *AND* campaign_id. The 
join results in no entry.

 # if we write both of the views to S3 and read them back to new data frames, 
suddenly the join is working.
 # if we disable AQE the join is working
 # running checkpoint on the views does not make join #4 work

  was:
we create several temp tables (views) by loading several delta tables and 
joining between them. 
those views are used for calculation of different metrics. each metric requires 
different views to be used. some of the more popular views are cached for 
better performance. 

we have noticed that once we upgraded from spark 3.4.2  to spark 3.5.0 some of 
the join started to fail.

we can reproduce a case were we have 2 data frames (views) (this is not the 
real names  / values we use. this is just for the example)
 # users with the column user_id, campaign_id, user_name.
we make sure it has a single entry
'111111', '22222', 'Jhon Doe'
 # actions with the column user_id, campaign_id, action_id, action count
we make sure it has a single entry
'111111', '22222', 'clicks', 5

 
 # users view can be filtered for user_id = '111111' or/and campaign_id = 
'22222' and it will find the existing single row
 # actions view can be filtered for user_id = '111111' or/and campaign_id = 
'22222' and it will find the existing single row
 # users and actions can be inner join by user_id *OR* campaign_id and the join 
will be successful. 
 # users and actions can *not* be inner join by user_id *AND* campaign_id. The 
join results in no entry.


 # if we write both of the views to S3 and read them back to new data frames, 
suddenly the join is working.
 # if we disable AQE the join is working
 # running checkpoint on the views does not work


> join between cached temp tables result in missing entries
> ---------------------------------------------------------
>
>                 Key: SPARK-47034
>                 URL: https://issues.apache.org/jira/browse/SPARK-47034
>             Project: Spark
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 3.5.0
>            Reporter: shurik mermelshtein
>            Priority: Major
>
> we create several temp tables (views) by loading several delta tables and 
> joining between them. 
> those views are used for calculation of different metrics. each metric 
> requires different views to be used. some of the more popular views are 
> cached for better performance. 
> we have noticed that once we upgraded from spark 3.4.2  to spark 3.5.0 some 
> of the join started to fail.
> we can reproduce a case were we have 2 data frames (views) (this is not the 
> real names  / values we use. this is just for the example)
>  # users with the column user_id, campaign_id, user_name.
> we make sure it has a single entry
> '111111', '22222', 'Jhon Doe'
>  # actions with the column user_id, campaign_id, action_id, action count
> we make sure it has a single entry
> '111111', '22222', 'clicks', 5
>  
>  # users view can be filtered for user_id = '111111' or/and campaign_id = 
> '22222' and it will find the existing single row
>  # actions view can be filtered for user_id = '111111' or/and campaign_id = 
> '22222' and it will find the existing single row
>  # users and actions can be inner join by user_id *OR* campaign_id and the 
> join will be successful. 
>  # users and actions can *not* be inner join by user_id *AND* campaign_id. 
> The join results in no entry.
>  # if we write both of the views to S3 and read them back to new data frames, 
> suddenly the join is working.
>  # if we disable AQE the join is working
>  # running checkpoint on the views does not make join #4 work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47034) join between cached temp tables result in missing entries

Reply via email to