[ 
https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463616#comment-16463616
 ] 

Marco Gaido commented on SPARK-24177:
-------------------------------------

[~ajay_monga] please may you try with a higher spark version? 1.6 is pretty old 
and not maintained anymore...

> Spark returning inconsistent rows and data in a join query when run using 
> Spark SQL (using SQLContext.sql(...))
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24177
>                 URL: https://issues.apache.org/jira/browse/SPARK-24177
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: Production
>            Reporter: Ajay Monga
>            Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns 
> different rows and the value of the column on which a simple multiplication 
> takes place returns different values:
> The query is like:
> SELECT
>  second_table.date_value, SUM(XXX * second_table.shift_value)
>  FROM
>  (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
>  )
>  intermediate LEFT OUTER
>  JOIN second_table ON second_table.date_value = (<Logic to change the 
> 'date_value' from first table, say if it's a Saturday or Sunday then use 
> Monday, else next valid working date>)
>  AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
>  )
>  GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - 
> one for first_table and other for second_table before joining. Then the 
> results get split across partitions, seemingly grouped/distributed by the 
> join column, which is 'date_value'. In the join there is a date shift logic 
> that fails to join in some cases when it should, primarily for the 
> date_values at the edge of the partitions distributed across the executors. 
> So, the execution is dependent on how the data (or the rdd) of the individual 
> queries is partitioned in the first place, which is not ideal as a normal 
> looking ANSI standard SQL query is not behaving consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to