[ https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465729#comment-16465729 ]
Ajay Monga commented on SPARK-24177: ------------------------------------ Thanks Marco. We have a few systems running on the latest version of Spark but the system that behaved erratic is still on 1.6. We are planning to move it to a later version, possibly to 2.2 but I would appreciate if someone can confirm my understanding. > Spark returning inconsistent rows and data in a join query when run using > Spark SQL (using SQLContext.sql(...)) > --------------------------------------------------------------------------------------------------------------- > > Key: SPARK-24177 > URL: https://issues.apache.org/jira/browse/SPARK-24177 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.0 > Environment: Production > Reporter: Ajay Monga > Priority: Major > > Spark SQL is returning inconsistent result for a JOIN query. It returns > different rows and the value of the column on which a simple multiplication > takes place returns different values: > The query is like: > SELECT > second_table.date_value, SUM(XXX * second_table.shift_value) > FROM > ( > SELECT > date_value, SUM(value) as XXX > FROM first_table > WHERE > AND date IN ( '2018-01-01', '2018-01-02' ) > GROUP BY date_value > ) > intermediate LEFT OUTER > JOIN second_table ON second_table.date_value = (<Logic to change the > 'date_value' from first table, say if it's a Saturday or Sunday then use > Monday, else next valid working date>) > AND second_table.date_value IN ( > '2018-01-02', > '2018-01-03' > ) > GROUP BY second_table.date_value > > Suspicion is that, the execution of above query is split into two queries - > one for first_table and other for second_table before joining. Then the > results get split across partitions, seemingly grouped/distributed by the > join column, which is 'date_value'. In the join there is a date shift logic > that fails to join in some cases when it should, primarily for the > date_values at the edge of the partitions distributed across the executors. > So, the execution is dependent on how the data (or the rdd) of the individual > queries is partitioned in the first place, which is not ideal as a normal > looking ANSI standard SQL query is not behaving consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org