Chris Kipers created SPARK-20356:
------------------------------------

             Summary: Spark sql group by returns incorrect results after join + 
distinct transformations
                 Key: SPARK-20356
                 URL: https://issues.apache.org/jira/browse/SPARK-20356
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.0
         Environment: Linux mint 18
Python 3.5
            Reporter: Chris Kipers


I'm experiencing a bug with the head version of spark as of 4/17/2017. After 
joining to dataframes, renaming a column and invoking distinct, the results of 
the aggregation is incorrect after caching the dataframe. The following code 
snippet consistently reproduces the error.

from pyspark.sql import SparkSession
import pyspark.sql.functions as sf
import pandas as pd

spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

mapping_sdf = spark.createDataFrame(pd.DataFrame([
    {"ITEM": "a", "GROUP": 1},
    {"ITEM": "b", "GROUP": 1},
    {"ITEM": "c", "GROUP": 2}
]))

items_sdf = spark.createDataFrame(pd.DataFrame([
    {"ITEM": "a", "ID": 1},
    {"ITEM": "b", "ID": 2},
    {"ITEM": "c", "ID": 3}
]))

mapped_sdf = \
    items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
sf.col("GROUP").alias('ITEM')).distinct()

print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
mapped_sdf.cache()
print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 3, incorrect

The next code snippet is almost the same after the first except I don't call 
distinct on the dataframe. This snippet performs as expected:

mapped_sdf = \
    items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
sf.col("GROUP").alias('ITEM'))

print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
mapped_sdf.cache()
print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct

I don't experience this bug with spark 2.1 or event earlier versions for 2.2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to