[jira] [Commented] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`

2023-02-13 Thread Ted Chester Jenks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687794#comment-17687794
 ] 

Ted Chester Jenks commented on SPARK-42397:
---

Is it ever expected for df.show() and df.collect() to give different results? 
That is what struck me as odd in this case and yes those two give different 
values.

> Inconsistent data produced by `FlatMapCoGroupsInPandas`
> ---
>
> Key: SPARK-42397
> URL: https://issues.apache.org/jira/browse/SPARK-42397
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, SQL
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Ted Chester Jenks
>Priority: Minor
>
> We are seeing inconsistent data returned when using 
> `FlatMapCoGroupsInPandas`. In the PySpark example from the comments, when we 
> call `grouped_df.collect()` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] }}
>  
> When we call `grouped_df.show(5, truncate=False)` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')", 
> xyz='1234')] }}
>  
> When we call `grouped_df_1.collect()` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')", 
> xyz='1234')] }}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`

2023-02-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687759#comment-17687759
 ] 

Hyukjin Kwon commented on SPARK-42397:
--

It's probably related the order which Spark doesn't guarantee. Is the actual 
value different?

> Inconsistent data produced by `FlatMapCoGroupsInPandas`
> ---
>
> Key: SPARK-42397
> URL: https://issues.apache.org/jira/browse/SPARK-42397
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, SQL
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Ted Chester Jenks
>Priority: Minor
>
> We are seeing inconsistent data returned when using 
> `FlatMapCoGroupsInPandas`. In the PySpark example from the comments, when we 
> call `grouped_df.collect()` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] }}
>  
> When we call `grouped_df.show(5, truncate=False)` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')", 
> xyz='1234')] }}
>  
> When we call `grouped_df_1.collect()` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')", 
> xyz='1234')] }}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`

2023-02-10 Thread Ted Chester Jenks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687011#comment-17687011
 ] 

Ted Chester Jenks commented on SPARK-42397:
---

{{    test_df = spark.createDataFrame(}}
{{        [}}
{{            ["1", "23", "abc", "blah", "def", "1"],}}
{{            ["1", "23", "abc", "blah", "def", "1"],}}
{{            ["1", "23", "abc", "blah", "def", "2"],}}
{{            ["1", "23", "abc", "blah", "def", "2"],}}
{{        ],}}
{{        ["cluster", "partition", "event", "abc", "def", "one_or_two"]}}
{{    )}}
{{    df1 = test_df.filter(}}
{{        F.col("one_or_two") == "1"}}
{{    ).select(}}
{{        "cluster", "event", "abc"}}
{{    )}}{{    df2 = test_df.filter(}}
{{        F.col("one_or_two") == "2"}}
{{    ).select(}}
{{        "cluster", "event", "def"}}
{{    )}}
{{    def get_schema(l, r):}}
{{            return pd.DataFrame(}}
{{                [(str(l.columns), str(r.columns))],}}
{{                columns=["left_colms", "right_colms"]}}
{{            )}}{{   grouped_df = 
df1.groupBy("cluster").cogroup(df2.groupBy("cluster")).applyInPandas(}}
{{        get_schema, "left_colms string, right_colms string"}}
{{    )}}
{{    grouped_df_1 = grouped_df.withColumn(}}
{{       "xyz", F.lit("1234")}}
{{     )}}

> Inconsistent data produced by `FlatMapCoGroupsInPandas`
> ---
>
> Key: SPARK-42397
> URL: https://issues.apache.org/jira/browse/SPARK-42397
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, SQL
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Ted Chester Jenks
>Priority: Minor
>
> We are seeing inconsistent data returned when using 
> `FlatMapCoGroupsInPandas`. In the PySpark example:
> {{    test_df = spark.createDataFrame(}}
> {{        [}}
> {{            ["1", "23", "abc", "blah", "def", "1"],}}
> {{            ["1", "23", "abc", "blah", "def", "1"],}}
> {{            ["1", "23", "abc", "blah", "def", "2"],}}
> {{            ["1", "23", "abc", "blah", "def", "2"],}}
> {{        ],}}
> {{        ["cluster", "partition", "event", "abc", "def", "one_or_two"]}}
> {{    )}}
> {{    df1 = test_df.filter(}}
> {{        F.col("one_or_two") == "1"}}
> {{    ).select(}}
> {{        "cluster", "event", "abc"}}
> {{    )}}{{    df2 = test_df.filter(}}
> {{        F.col("one_or_two") == "2"}}
> {{    ).select(}}
> {{        "cluster", "event", "def"}}
> {{    )}}
> {{    def get_schema(l, r):}}
> {{            return pd.DataFrame(}}
> {{                [(str(l.columns), str(r.columns))],}}
> {{                columns=["left_colms", "right_colms"]}}
> {{            )}}{{   grouped_df = 
> df1.groupBy("cluster").cogroup(df2.groupBy("cluster")).applyInPandas(}}
> {{        get_schema, "left_colms string, right_colms string"}}
> {{    )}}
> {{    grouped_df_1 = grouped_df.withColumn(}}
> {{       "xyz", F.lit("1234")}}
> {{     )}}
> When we call `grouped_df.collect()` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] }}
>  
> When we call `grouped_df.show(5, truncate=False)` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')", 
> xyz='1234')] }}
>  
> When we call `grouped_df_1.collect()` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')", 
> xyz='1234')] }}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org