[ https://issues.apache.org/jira/browse/SPARK-39942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-39942: ------------------------------------ Assignee: (was: Apache Spark) > The input parameter of nsmallest should be validated as Integer > --------------------------------------------------------------- > > Key: SPARK-39942 > URL: https://issues.apache.org/jira/browse/SPARK-39942 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark > Affects Versions: 3.2.2 > Environment: PySpark: Master > Reporter: bo zhao > Priority: Minor > > The input parameter of nsmallest should be validated as Integer. So I think > we might miss this validation. > And PySpark will raise Error when we input the strange types. Such as > > PySpark: > {code:java} > >>> df = ps.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}, columns=['A', > >>> 'B']) > >>> df.groupby(['A'])['B'].nsmallest(1) > A > 1 0 3 > 2 1 4 > 3 2 5 > 4 3 6 > Name: B, dtype: int64 > >>> df.groupby(['A'])['B'].nsmallest(True) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home/spark/spark/python/pyspark/pandas/groupby.py", line 3598, in > nsmallest > sdf.withColumn(temp_rank_column, F.row_number().over(window)) > File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 2129, in > filter > jdf = self._jdf.filter(condition._jc) > File > "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/py4j/java_gateway.py", > line 1321, in __call__ > return_value = get_return_value( > File "/home/spark/spark/python/pyspark/sql/utils.py", line 196, in deco > raise converted from None > pyspark.sql.utils.AnalysisException: cannot resolve '(__rank__ <= true)' due > to data type mismatch: differing types in '(__rank__ <= true)' (int and > boolean).; > 'Filter (__rank__#4995 <= true) > +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, > __natural_order__#4983L, __rank__#4995] > +- Project [__index_level_0__#4988L, __index_level_1__#4989L, B#4979L, > __natural_order__#4983L, __rank__#4995, __rank__#4995] > +- Window [row_number() windowspecdefinition(__index_level_0__#4988L, > B#4979L ASC NULLS FIRST, __natural_order__#4983L ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > __rank__#4995], [__index_level_0__#4988L], [B#4979L ASC NULLS FIRST, > __natural_order__#4983L ASC NULLS FIRST] > +- Project [__index_level_0__#4988L, __index_level_1__#4989L, > B#4979L, __natural_order__#4983L] > +- Project [A#4978L AS __index_level_0__#4988L, > __index_level_0__#4977L AS __index_level_1__#4989L, B#4979L, > __natural_order__#4983L] > +- Project [__index_level_0__#4977L, A#4978L, B#4979L, > monotonically_increasing_id() AS __natural_order__#4983L] > +- LogicalRDD [__index_level_0__#4977L, A#4978L, B#4979L], > false > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org