[ https://issues.apache.org/jira/browse/SPARK-28533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-28533: --------------------------------- Description: Hello, I have faced an issue while casting the datatype of a column in pyspark 2.4.1. Say that i have the following data frame in which column B is a string which has a list or arrays, and I want to convert the column B to a Arraytype, so i have used the following code {code:java} import ast from pyspark.sql.types import * from pyspark.sql.functions import udf, col df = spark.createDataFrame([("row1", "[[12.46575,13.78697],[10.565,*11*]]"), ("row2", "[[1.2345,13.45454],[6.6868,0.234524]]")], schema=['A', 'B']) to_array = udf(lambda x: ast.literal_eval(x.replace('\"', '')), ArrayType(ArrayType(DoubleType()))) df = df.withColumn('C', to_array(col('B'))) df.show(){code} The new column C is an ArrayType of ArrayType with elements of DoubleType. But with this code I was not able to convert the integer type value *11.* This value is not part of the final output. ||A||B||C|| |row1|[[12.46575,13.78697],[10.565,*11*]]|[[12.46575, 13.78697], [10.565,]]| |row2|[[1.2345,13.45454],[6.6868,0.234524]]|[[1.2345, 13.45454], [6.6868, 0.234524]]| As you could see, the column C does not have 11. If I replace the DoubleType to FloatType same error and if I replace it with DecimalType the output is all empty. I am not sure whether there is a issue with my code or it is a bug. Hope, someone can provide some clarification on this. Thanks!! was: Hello, I have faced an issue while casting the datatype of a column in pyspark 2.4.1. Say that i have the following data frame in which column B is a string which has a list or arrays, and I want to convert the column B to a Arraytype, so i have used the following code {code:java} import ast from pyspark.sql.types import * from pyspark.sql.functions import udf df = spark.createDataFrame([("row1", "[[12.46575,13.78697],[10.565,*11*]]"), ("row2", "[[1.2345,13.45454],[6.6868,0.234524]]")], schema=['A', 'B']) to_array = udf(lambda x: ast.literal_eval(x.replace('\"', '')), ArrayType(ArrayType(DoubleType()))) df = df.withColumn('C', to_array(col('B'))) {code} The new column C is an ArrayType of ArrayType with elements of DoubleType. But with this code I was not able to convert the integer type value *11.* This value is not part of the final output. ||A||B||C|| |row1|[[12.46575,13.78697],[10.565,*11*]]|[[12.46575, 13.78697], [10.565,]]| |row2|[[1.2345,13.45454],[6.6868,0.234524]]|[[1.2345, 13.45454], [6.6868, 0.234524]]| As you could see, the column C does not have 11. If I replace the DoubleType to FloatType same error and if I replace it with DecimalType the output is all empty. I am not sure whether there is a issue with my code or it is a bug. Hope, someone can provide some clarification on this. Thanks!! > PySpark datatype casting error > ------------------------------ > > Key: SPARK-28533 > URL: https://issues.apache.org/jira/browse/SPARK-28533 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.1 > Reporter: RoopTeja Muppalla > Priority: Minor > > Hello, > I have faced an issue while casting the datatype of a column in pyspark 2.4.1. > Say that i have the following data frame in which column B is a string which > has a list or arrays, and I want to convert the column B to a Arraytype, so i > have used the following code > {code:java} > import ast > from pyspark.sql.types import * > from pyspark.sql.functions import udf, col > df = spark.createDataFrame([("row1", "[[12.46575,13.78697],[10.565,*11*]]"), > ("row2", "[[1.2345,13.45454],[6.6868,0.234524]]")], schema=['A', 'B']) > to_array = udf(lambda x: ast.literal_eval(x.replace('\"', '')), > ArrayType(ArrayType(DoubleType()))) > df = df.withColumn('C', to_array(col('B'))) > df.show(){code} > The new column C is an ArrayType of ArrayType with elements of DoubleType. > But with this code I was not able to convert the integer type value *11.* > This value is not part of the final output. > ||A||B||C|| > |row1|[[12.46575,13.78697],[10.565,*11*]]|[[12.46575, 13.78697], [10.565,]]| > |row2|[[1.2345,13.45454],[6.6868,0.234524]]|[[1.2345, 13.45454], [6.6868, > 0.234524]]| > As you could see, the column C does not have 11. If I replace the DoubleType > to FloatType same error and if I replace it with DecimalType the output is > all empty. > I am not sure whether there is a issue with my code or it is a bug. > Hope, someone can provide some clarification on this. Thanks!! > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org