[jira] [Commented] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

Ajay Monga (JIRA) Mon, 07 May 2018 03:12:29 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465729#comment-16465729
 ]


Ajay Monga commented on SPARK-24177:
------------------------------------

Thanks Marco. We have a few systems running on the latest version of Spark but 
the system that behaved erratic is still on 1.6. We are planning to move it to 
a later version, possibly to 2.2 but I would appreciate if someone can confirm 
my understanding.

> Spark returning inconsistent rows and data in a join query when run using 
> Spark SQL (using SQLContext.sql(...))
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24177
>                 URL: https://issues.apache.org/jira/browse/SPARK-24177
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: Production
>            Reporter: Ajay Monga
>            Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns 
> different rows and the value of the column on which a simple multiplication 
> takes place returns different values:
> The query is like:
> SELECT
>  second_table.date_value, SUM(XXX * second_table.shift_value)
>  FROM
>  (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
>  )
>  intermediate LEFT OUTER
>  JOIN second_table ON second_table.date_value = (<Logic to change the 
> 'date_value' from first table, say if it's a Saturday or Sunday then use 
> Monday, else next valid working date>)
>  AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
>  )
>  GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - 
> one for first_table and other for second_table before joining. Then the 
> results get split across partitions, seemingly grouped/distributed by the 
> join column, which is 'date_value'. In the join there is a date shift logic 
> that fails to join in some cases when it should, primarily for the 
> date_values at the edge of the partitions distributed across the executors. 
> So, the execution is dependent on how the data (or the rdd) of the individual 
> queries is partitioned in the first place, which is not ideal as a normal 
> looking ANSI standard SQL query is not behaving consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

Reply via email to