[ https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973936#comment-15973936 ]
Liang-Chi Hsieh commented on SPARK-20356: ----------------------------------------- [~dkbiswal] Yeah, right. Thanks. We need to force the partition to reproduce this. > Spark sql group by returns incorrect results after join + distinct > transformations > ---------------------------------------------------------------------------------- > > Key: SPARK-20356 > URL: https://issues.apache.org/jira/browse/SPARK-20356 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Environment: Linux mint 18 > Python 3.5 > Reporter: Chris Kipers > > I'm experiencing a bug with the head version of spark as of 4/17/2017. After > joining to dataframes, renaming a column and invoking distinct, the results > of the aggregation is incorrect after caching the dataframe. The following > code snippet consistently reproduces the error. > from pyspark.sql import SparkSession > import pyspark.sql.functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > mapping_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "GROUP": 1}, > {"ITEM": "b", "GROUP": 1}, > {"ITEM": "c", "GROUP": 2} > ])) > items_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "ID": 1}, > {"ITEM": "b", "ID": 2}, > {"ITEM": "c", "ID": 3} > ])) > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')).distinct() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 3, incorrect > The next code snippet is almost the same after the first except I don't call > distinct on the dataframe. This snippet performs as expected: > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')) > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > I don't experience this bug with spark 2.1 or event earlier versions for 2.2 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org