[jira] [Commented] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687794#comment-17687794 ] Ted Chester Jenks commented on SPARK-42397: --- Is it ever expected for df.show() and df.collect() to give different results? That is what struck me as odd in this case and yes those two give different values. > Inconsistent data produced by `FlatMapCoGroupsInPandas` > --- > > Key: SPARK-42397 > URL: https://issues.apache.org/jira/browse/SPARK-42397 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, SQL >Affects Versions: 3.3.0, 3.3.1 >Reporter: Ted Chester Jenks >Priority: Minor > > We are seeing inconsistent data returned when using > `FlatMapCoGroupsInPandas`. In the PySpark example from the comments, when we > call `grouped_df.collect()` we get: > > {{[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] }} > > When we call `grouped_df.show(5, truncate=False)` we get: > > {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')", > xyz='1234')] }} > > When we call `grouped_df_1.collect()` we get: > > {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')", > xyz='1234')] }} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687759#comment-17687759 ] Hyukjin Kwon commented on SPARK-42397: -- It's probably related the order which Spark doesn't guarantee. Is the actual value different? > Inconsistent data produced by `FlatMapCoGroupsInPandas` > --- > > Key: SPARK-42397 > URL: https://issues.apache.org/jira/browse/SPARK-42397 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, SQL >Affects Versions: 3.3.0, 3.3.1 >Reporter: Ted Chester Jenks >Priority: Minor > > We are seeing inconsistent data returned when using > `FlatMapCoGroupsInPandas`. In the PySpark example from the comments, when we > call `grouped_df.collect()` we get: > > {{[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] }} > > When we call `grouped_df.show(5, truncate=False)` we get: > > {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')", > xyz='1234')] }} > > When we call `grouped_df_1.collect()` we get: > > {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')", > xyz='1234')] }} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687011#comment-17687011 ] Ted Chester Jenks commented on SPARK-42397: --- {{ test_df = spark.createDataFrame(}} {{ [}} {{ ["1", "23", "abc", "blah", "def", "1"],}} {{ ["1", "23", "abc", "blah", "def", "1"],}} {{ ["1", "23", "abc", "blah", "def", "2"],}} {{ ["1", "23", "abc", "blah", "def", "2"],}} {{ ],}} {{ ["cluster", "partition", "event", "abc", "def", "one_or_two"]}} {{ )}} {{ df1 = test_df.filter(}} {{ F.col("one_or_two") == "1"}} {{ ).select(}} {{ "cluster", "event", "abc"}} {{ )}}{{ df2 = test_df.filter(}} {{ F.col("one_or_two") == "2"}} {{ ).select(}} {{ "cluster", "event", "def"}} {{ )}} {{ def get_schema(l, r):}} {{ return pd.DataFrame(}} {{ [(str(l.columns), str(r.columns))],}} {{ columns=["left_colms", "right_colms"]}} {{ )}}{{ grouped_df = df1.groupBy("cluster").cogroup(df2.groupBy("cluster")).applyInPandas(}} {{ get_schema, "left_colms string, right_colms string"}} {{ )}} {{ grouped_df_1 = grouped_df.withColumn(}} {{ "xyz", F.lit("1234")}} {{ )}} > Inconsistent data produced by `FlatMapCoGroupsInPandas` > --- > > Key: SPARK-42397 > URL: https://issues.apache.org/jira/browse/SPARK-42397 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, SQL >Affects Versions: 3.3.0, 3.3.1 >Reporter: Ted Chester Jenks >Priority: Minor > > We are seeing inconsistent data returned when using > `FlatMapCoGroupsInPandas`. In the PySpark example: > {{ test_df = spark.createDataFrame(}} > {{ [}} > {{ ["1", "23", "abc", "blah", "def", "1"],}} > {{ ["1", "23", "abc", "blah", "def", "1"],}} > {{ ["1", "23", "abc", "blah", "def", "2"],}} > {{ ["1", "23", "abc", "blah", "def", "2"],}} > {{ ],}} > {{ ["cluster", "partition", "event", "abc", "def", "one_or_two"]}} > {{ )}} > {{ df1 = test_df.filter(}} > {{ F.col("one_or_two") == "1"}} > {{ ).select(}} > {{ "cluster", "event", "abc"}} > {{ )}}{{ df2 = test_df.filter(}} > {{ F.col("one_or_two") == "2"}} > {{ ).select(}} > {{ "cluster", "event", "def"}} > {{ )}} > {{ def get_schema(l, r):}} > {{ return pd.DataFrame(}} > {{ [(str(l.columns), str(r.columns))],}} > {{ columns=["left_colms", "right_colms"]}} > {{ )}}{{ grouped_df = > df1.groupBy("cluster").cogroup(df2.groupBy("cluster")).applyInPandas(}} > {{ get_schema, "left_colms string, right_colms string"}} > {{ )}} > {{ grouped_df_1 = grouped_df.withColumn(}} > {{ "xyz", F.lit("1234")}} > {{ )}} > When we call `grouped_df.collect()` we get: > > {{[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] }} > > When we call `grouped_df.show(5, truncate=False)` we get: > > {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')", > xyz='1234')] }} > > When we call `grouped_df_1.collect()` we get: > > {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')", > xyz='1234')] }} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org