[jira] [Updated] (SPARK-28533) PySpark datatype casting error

Hyukjin Kwon (JIRA) Fri, 26 Jul 2019 23:22:26 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-28533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-28533:
---------------------------------
    Description: 
Hello,

I have faced an issue while casting the datatype of a column in pyspark 2.4.1.

Say that i have the following data frame in which column B is a string which 
has a list or arrays, and I want to convert the column B to a Arraytype, so i 
have used the following code
{code:java}
import ast
from pyspark.sql.types import *
from pyspark.sql.functions import udf, col

df = spark.createDataFrame([("row1", "[[12.46575,13.78697],[10.565,*11*]]"),  
("row2", "[[1.2345,13.45454],[6.6868,0.234524]]")], schema=['A', 'B'])
to_array = udf(lambda x: ast.literal_eval(x.replace('\"', '')), 
ArrayType(ArrayType(DoubleType())))
df = df.withColumn('C', to_array(col('B')))
df.show(){code}
The new column C is an ArrayType of ArrayType with elements of DoubleType. But 
with this code I was not able to convert the integer type value *11.* This 
value is not part of the final output.
||A||B||C||
|row1|[[12.46575,13.78697],[10.565,*11*]]|[[12.46575, 13.78697], [10.565,]]|
|row2|[[1.2345,13.45454],[6.6868,0.234524]]|[[1.2345, 13.45454], [6.6868, 
0.234524]]|

As you could see, the column C does not have 11. If I replace the DoubleType to 
FloatType same error and if I replace it with DecimalType the output is all 
empty.

I am not sure whether there is a issue with my code or it is a bug.

Hope, someone can provide some clarification on this. Thanks!!

 

  was:
Hello,

I have faced an issue while casting the datatype of a column in pyspark 2.4.1.

Say that i have the following data frame in which column B is a string which 
has a list or arrays, and I want to convert the column B to a Arraytype, so i 
have used the following code
{code:java}
import ast
from pyspark.sql.types import *
from pyspark.sql.functions import udf

df = spark.createDataFrame([("row1", "[[12.46575,13.78697],[10.565,*11*]]"),  
("row2", "[[1.2345,13.45454],[6.6868,0.234524]]")], schema=['A', 'B'])
to_array = udf(lambda x: ast.literal_eval(x.replace('\"', '')), 
ArrayType(ArrayType(DoubleType())))
df = df.withColumn('C', to_array(col('B')))
{code}
The new column C is an ArrayType of ArrayType with elements of DoubleType. But 
with this code I was not able to convert the integer type value *11.* This 
value is not part of the final output.
||A||B||C||
|row1|[[12.46575,13.78697],[10.565,*11*]]|[[12.46575, 13.78697], [10.565,]]|
|row2|[[1.2345,13.45454],[6.6868,0.234524]]|[[1.2345, 13.45454], [6.6868, 
0.234524]]|

As you could see, the column C does not have 11. If I replace the DoubleType to 
FloatType same error and if I replace it with DecimalType the output is all 
empty.

I am not sure whether there is a issue with my code or it is a bug.

Hope, someone can provide some clarification on this. Thanks!!

 


> PySpark datatype casting error
> ------------------------------
>
>                 Key: SPARK-28533
>                 URL: https://issues.apache.org/jira/browse/SPARK-28533
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.1
>            Reporter: RoopTeja Muppalla
>            Priority: Minor
>
> Hello,
> I have faced an issue while casting the datatype of a column in pyspark 2.4.1.
> Say that i have the following data frame in which column B is a string which 
> has a list or arrays, and I want to convert the column B to a Arraytype, so i 
> have used the following code
> {code:java}
> import ast
> from pyspark.sql.types import *
> from pyspark.sql.functions import udf, col
> df = spark.createDataFrame([("row1", "[[12.46575,13.78697],[10.565,*11*]]"),  
> ("row2", "[[1.2345,13.45454],[6.6868,0.234524]]")], schema=['A', 'B'])
> to_array = udf(lambda x: ast.literal_eval(x.replace('\"', '')), 
> ArrayType(ArrayType(DoubleType())))
> df = df.withColumn('C', to_array(col('B')))
> df.show(){code}
> The new column C is an ArrayType of ArrayType with elements of DoubleType. 
> But with this code I was not able to convert the integer type value *11.* 
> This value is not part of the final output.
> ||A||B||C||
> |row1|[[12.46575,13.78697],[10.565,*11*]]|[[12.46575, 13.78697], [10.565,]]|
> |row2|[[1.2345,13.45454],[6.6868,0.234524]]|[[1.2345, 13.45454], [6.6868, 
> 0.234524]]|
> As you could see, the column C does not have 11. If I replace the DoubleType 
> to FloatType same error and if I replace it with DecimalType the output is 
> all empty.
> I am not sure whether there is a issue with my code or it is a bug.
> Hope, someone can provide some clarification on this. Thanks!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28533) PySpark datatype casting error

Reply via email to