[ https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
yuanxm updated SPARK-35667: --------------------------- Description: SQL as follow gets incorrect results sometimes when spark.speculation is true: {code:java} SELECT count(1) FROM (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt) FROM (SELECT dt FROM test_table)tmpa1)tmpa2{code} With spark.speculation=true, the count result is less than the correct one. It's more likely to get incorrect result when there are more speculative tasks. `test.py`: {code:java} import sys for line in sys.stdin: line = line.strip() arr = line.split() print "\t".join(arr){code} spark-sql command: {code:java} ./bin/spark-sql --master yarn \ --conf spark.speculation=true \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.executorIdleTimeout=5s \ --conf spark.dynamicAllocation.initialExecutor=1 \ --conf spark.dynamicAllocation.maxExecutors=40 {code} was: SQL as follow gets incorrect results sometimes when spark.speculation is true: {code:java} SELECT count(1) FROM (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt) FROM (SELECT dt FROM test_table)tmpa1)tmpa2{code} With spark.speculation=true, the count result is less than the correct one. It's more likely to get incorrect result when there is more speculative tasks. `test.py`: {code:java} import sys for line in sys.stdin: line = line.strip() arr = line.split() print "\t".join(arr){code} spark-sql command: {code:java} ./bin/spark-sql --master yarn \ --conf spark.speculation=true \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.executorIdleTimeout=5s \ --conf spark.dynamicAllocation.initialExecutor=1 \ --conf spark.dynamicAllocation.maxExecutors=40 {code} > spark.speculation causes incorrect query results with TRANSFORM > --------------------------------------------------------------- > > Key: SPARK-35667 > URL: https://issues.apache.org/jira/browse/SPARK-35667 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.8 > Reporter: yuanxm > Priority: Major > > SQL as follow gets incorrect results sometimes when spark.speculation is > true: > {code:java} > SELECT count(1) > FROM > (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt) > FROM > (SELECT dt > FROM test_table)tmpa1)tmpa2{code} > With spark.speculation=true, the count result is less than the correct one. > It's more likely to get incorrect result when there are more speculative > tasks. > `test.py`: > {code:java} > import sys > for line in sys.stdin: > line = line.strip() > arr = line.split() > print "\t".join(arr){code} > > spark-sql command: > {code:java} > ./bin/spark-sql --master yarn \ > --conf spark.speculation=true \ > --conf spark.shuffle.service.enabled=true \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.dynamicAllocation.executorIdleTimeout=5s \ > --conf spark.dynamicAllocation.initialExecutor=1 \ > --conf spark.dynamicAllocation.maxExecutors=40 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org