[ 
https://issues.apache.org/jira/browse/SPARK-41018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629220#comment-17629220
 ] 

Nikesh commented on SPARK-41018:
--------------------------------

Attached the notebooks I used. When the data size is small( Notebook 
->ZScoreWithKoalas_PandasOnSpark_SmallerDataset), both koalas and Pandas output 
match.

But when the data size is huge, koalas output differs from the Pandas output.

> Koalas.idxmin() is not picking the minimum value from a dataframe, but 
> pandas.idxmin() gives
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41018
>                 URL: https://issues.apache.org/jira/browse/SPARK-41018
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.1
>         Environment: databricks
>            Reporter: Nikesh
>            Priority: Critical
>             Fix For: 3.3.1
>
>         Attachments: ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html, 
> ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html
>
>
> Hi,
> I have a koalas dataframe with age and income and I calculated Zscore on age 
> and income and then norms is calculated using age_zscore and 
> income_zscore(new column name is sq_dist). Then I tried to do an idxmin on 
> the new column, but its not giving the minimum value.
> I did the same operations on a Pandas dataframe, but it gives the minimum 
> value .
> Please find attached the notebook for step by step operations I performed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to