[ https://issues.apache.org/jira/browse/SPARK-36283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Darcy Shen updated SPARK-36283: ------------------------------- Description: h2. Case 1: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and without schema {code:python} spark = SparkSession.builder.getOrCreate() spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.show() {code} h3. Incorrect result {code} +----------+ | point| +----------+ |(0.0, 0.0)| |(0.0, 0.0)| +----------+ {code} h2. Case 2: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and with unmatched schema {code} spark = SparkSession.builder.getOrCreate() spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) schema = StructType([StructField('point', ExamplePointUDT(), False)]) df = spark.createDataFrame(pdf, schema) df.show() {code} h3. Error throwed as expected {code} Traceback (most recent call last): File "bug2.py", line 54, in <module> df = spark.createDataFrame(pdf, schema) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 673, in createDataFrame return super(SparkSession, self).createDataFrame( File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py", line 300, in createDataFrame return self._create_dataframe(data, schema, samplingRatio, verifySchema) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 700, in _create_dataframe rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 509, in _createFromLocal data = list(data) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 682, in prepare verify_func(obj) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify verify_value(obj) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1390, in verify_struct verifier(v) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify verify_value(obj) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1304, in verify_udf verifier(dataType.toInternal(obj)) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify verify_value(obj) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1354, in verify_array element_verifier(i) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify verify_value(obj) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1403, in verify_default verify_acceptable_types(obj) File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1291, in verify_acceptable_types raise TypeError(new_msg("%s can not accept object %r in type %s" TypeError: element in array field point: DoubleType can not accept object 1 in type <class 'int'> {code} h2. Case 3: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and with matched schema {code} spark = SparkSession.builder.getOrCreate() spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), ExamplePoint(2.0, 2.0)])}) schema = StructType([StructField('point', ExamplePointUDT(), False)]) df = spark.createDataFrame(pdf, schema) df.show() {code} h3. Correct result {code} +----------+ | point| +----------+ |(1.0, 1.0)| |(2.0, 2.0)| +----------+ {code} was: h2. Case 1 Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and without schema: {code:python} spark = SparkSession.builder.getOrCreate() spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.show() {code} h2. Case 2 Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and with unmatched schema: {code} spark = SparkSession.builder.getOrCreate() spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) schema = StructType([StructField('point', ExamplePointUDT(), False)]) df = spark.createDataFrame(pdf, schema) df.show() {code} h3. Case 3 Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and with matched schema: {code} spark = SparkSession.builder.getOrCreate() spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), ExamplePoint(2.0, 2.0)])}) schema = StructType([StructField('point', ExamplePointUDT(), False)]) df = spark.createDataFrame(pdf, schema) df.show() {code} > Bug when creating dataframe without schema and with Arrow disabled > ------------------------------------------------------------------ > > Key: SPARK-36283 > URL: https://issues.apache.org/jira/browse/SPARK-36283 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 3.1.1 > Reporter: Darcy Shen > Priority: Major > > h2. Case 1: Create PySpark Dataframe using Pandas DataFrame with Arrow > disabled and without schema > {code:python} > spark = SparkSession.builder.getOrCreate() > spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") > pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, > 2)])}) > df = spark.createDataFrame(pdf) > df.show() > {code} > h3. Incorrect result > {code} > +----------+ > | point| > +----------+ > |(0.0, 0.0)| > |(0.0, 0.0)| > +----------+ > {code} > h2. Case 2: Create PySpark Dataframe using Pandas DataFrame with Arrow > disabled and with unmatched schema > {code} > spark = SparkSession.builder.getOrCreate() > spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") > pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, > 2)])}) > schema = StructType([StructField('point', ExamplePointUDT(), False)]) > df = spark.createDataFrame(pdf, schema) > df.show() > {code} > h3. Error throwed as expected > {code} > Traceback (most recent call last): > File "bug2.py", line 54, in <module> > df = spark.createDataFrame(pdf, schema) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", > line 673, in createDataFrame > return super(SparkSession, self).createDataFrame( > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py", > line 300, in createDataFrame > return self._create_dataframe(data, schema, samplingRatio, verifySchema) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", > line 700, in _create_dataframe > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", > line 509, in _createFromLocal > data = list(data) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", > line 682, in prepare > verify_func(obj) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1409, in verify > verify_value(obj) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1390, in verify_struct > verifier(v) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1409, in verify > verify_value(obj) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1304, in verify_udf > verifier(dataType.toInternal(obj)) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1409, in verify > verify_value(obj) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1354, in verify_array > element_verifier(i) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1409, in verify > verify_value(obj) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1403, in verify_default > verify_acceptable_types(obj) > File > "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", > line 1291, in verify_acceptable_types > raise TypeError(new_msg("%s can not accept object %r in type %s" > TypeError: element in array field point: DoubleType can not accept object 1 > in type <class 'int'> > {code} > h2. Case 3: Create PySpark Dataframe using Pandas DataFrame with Arrow > disabled and with matched schema > {code} > spark = SparkSession.builder.getOrCreate() > spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false") > pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), > ExamplePoint(2.0, 2.0)])}) > schema = StructType([StructField('point', ExamplePointUDT(), False)]) > df = spark.createDataFrame(pdf, schema) > df.show() > {code} > h3. Correct result > {code} > +----------+ > | point| > +----------+ > |(1.0, 1.0)| > |(2.0, 2.0)| > +----------+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org