Michail Giannakopoulos created SPARK-31059: ----------------------------------------------
Summary: Spark's SQL "group by" local processing operator is broken. Key: SPARK-31059 URL: https://issues.apache.org/jira/browse/SPARK-31059 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 2.4.3 Environment: Windows 10. Reporter: Michail Giannakopoulos When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case. Steps to reproduce: 1. Start spark-shell as follows: bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory 2. Load the attached csv file: val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") 3. Create a temp view: gosales.createOrReplaceTempView("gosales") 4. Execute the following sql statement: spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show() Output: +--------------------+-----------------+----------------------------+ | Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +--------------------+-----------------+----------------------------+ | Golf Equipment| E-mail| 92.25| | Camping Equipment| Mail| 0.0| | Camping Equipment| Fax| null| | Golf Equipment| Telephone| 123.0| | Camping Equipment| Special| null| | Outdoor Protection| Telephone| 34218.19| |Mountaineering Eq...| Mail| 0.0| | Camping Equipment| Web| 32469.03| |Personal Accessories| Fax| 3318.7| | Golf Equipment| Sales visit| 143.5| |Mountaineering Eq...| Telephone| null| |Mountaineering Eq...| E-mail| null| | Outdoor Protection| Sales visit| 20522.42| | Outdoor Protection| Fax| 5857.54| |Personal Accessories| E-mail| 26679.640000000003| |Mountaineering Eq...| Fax| null| | Outdoor Protection| Web| 340836.85000000003| | Golf Equipment| Special| 0.0| | Outdoor Protection| E-mail| 28505.93| | Golf Equipment| Web| 3034.0| +--------------------+-----------------+----------------------------+ Expected output: +--------------------+-----------------+----------------------------+ | Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +--------------------+-----------------+----------------------------+ | Golf Equipment| E-mail| 92.25| | Golf Equipment| Fax| null| | Golf Equipment| Mail| 0.0| | Golf Equipment| Sales visit| 143.5| | Golf Equipment| Special| 0.0| | Golf Equipment| Telephone| 123.0| | Golf Equipment| Web| 3034.0| | Camping Equipment| E-mail| 1303.3999999999999| | Camping Equipment| Fax| null| | Camping Equipment| Sales visit| 4754.87| | Camping Equipment| Mail| 0.0| | Camping Equipment| Special| null| | Camping Equipment| Telephone| 5169.65| | Camping Equipment| Web| 32469.03| |Mountaineering Eq...| E-mail| null| |Mountaineering Eq...| Fax| null| |Mountaineering Eq...| Mail| 0.0| |Mountaineering Eq...| Special| null| |Mountaineering Eq...| Sales visit| null| |Mountaineering Eq...| Telephone| null| +--------------------+-----------------+----------------------------+ Notice how all the grouping columns should be bucketed together without being in order. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org