[ https://issues.apache.org/jira/browse/SPARK-32098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151773#comment-17151773 ]
Dongjoon Hyun commented on SPARK-32098: --------------------------------------- Hi, [~hyukjin.kwon] and [~bryanc]. I'm trying to verify this at 2.4.6 with the above example, but there was no luck until now. Did I miss something? {code} $ ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true Python 3.7.7 (default, Mar 21 2020, 21:07:30) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin Type "help", "copyright", "credits" or "license" for more information. 20/07/05 23:13:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.6 /_/ Using Python version 3.7.7 (default, Mar 21 2020 21:07:30) SparkSession available as 'spark'. >>> import pandas as pd >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., >>> 4.])).show() +---+ | a| +---+ | 1| | 2| | 3| +---+ {code} > Use iloc for positional slicing instead of direct slicing in createDataFrame > with Arrow > --------------------------------------------------------------------------------------- > > Key: SPARK-32098 > URL: https://issues.apache.org/jira/browse/SPARK-32098 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.4.6, 3.0.0 > Reporter: Hyukjin Kwon > Assignee: Hyukjin Kwon > Priority: Critical > Labels: correctness > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > When you use floats are index of pandas, it produces a wrong results: > {code} > >>> import pandas as pd > >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., > >>> 4.])).show() > +---+ > | a| > +---+ > | 1| > | 1| > | 2| > +---+ > {code} > This is because direct slicing uses the value as index when the index > contains floats: > {code} > >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:] > a > 2.0 1 > 3.0 2 > 4.0 3 > >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:] > a > 4.0 3 > >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:] > a > 4 3 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org