Michail Giannakopoulos created SPARK-31059:
----------------------------------------------

             Summary: Spark's SQL "group by" local processing operator is 
broken.
                 Key: SPARK-31059
                 URL: https://issues.apache.org/jira/browse/SPARK-31059
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.5, 2.4.3
         Environment: Windows 10.
            Reporter: Michail Giannakopoulos


When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
1. Start spark-shell as follows:
bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
2. Load the attached csv file:
val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
3. Create a temp view:
gosales.createOrReplaceTempView("gosales")
4. Execute the following sql statement:
spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
+--------------------+-----------------+----------------------------+
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+--------------------+-----------------+----------------------------+
| Golf Equipment| E-mail| 92.25|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Fax| null|
| Golf Equipment| Telephone| 123.0|
| Camping Equipment| Special| null|
| Outdoor Protection| Telephone| 34218.19|
|Mountaineering Eq...| Mail| 0.0|
| Camping Equipment| Web| 32469.03|
|Personal Accessories| Fax| 3318.7|
| Golf Equipment| Sales visit| 143.5|
|Mountaineering Eq...| Telephone| null|
|Mountaineering Eq...| E-mail| null|
| Outdoor Protection| Sales visit| 20522.42|
| Outdoor Protection| Fax| 5857.54|
|Personal Accessories| E-mail| 26679.640000000003|
|Mountaineering Eq...| Fax| null|
| Outdoor Protection| Web| 340836.85000000003|
| Golf Equipment| Special| 0.0|
| Outdoor Protection| E-mail| 28505.93|
| Golf Equipment| Web| 3034.0|
+--------------------+-----------------+----------------------------+

Expected output:
+--------------------+-----------------+----------------------------+
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+--------------------+-----------------+----------------------------+
| Golf Equipment| E-mail| 92.25|
| Golf Equipment| Fax| null|
| Golf Equipment| Mail| 0.0|
| Golf Equipment| Sales visit| 143.5|
| Golf Equipment| Special| 0.0|
| Golf Equipment| Telephone| 123.0|
| Golf Equipment| Web| 3034.0|
| Camping Equipment| E-mail| 1303.3999999999999|
| Camping Equipment| Fax| null|
| Camping Equipment| Sales visit| 4754.87|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Special| null|
| Camping Equipment| Telephone| 5169.65|
| Camping Equipment| Web| 32469.03|
|Mountaineering Eq...| E-mail| null|
|Mountaineering Eq...| Fax| null|
|Mountaineering Eq...| Mail| 0.0|
|Mountaineering Eq...| Special| null|
|Mountaineering Eq...| Sales visit| null|
|Mountaineering Eq...| Telephone| null|
+--------------------+-----------------+----------------------------+

Notice how all the grouping columns should be bucketed together without being 
in order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to