[jira] [Updated] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL

xsys (Jira) Sat, 01 Oct 2022 21:11:17 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-40630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xsys updated SPARK-40630:
-------------------------
    Description: 
h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
$SPARK_HOME/bin/spark-sql{code}
Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("ti 
me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0))))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at <console>:28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
+----+
|c1  |
+----+
|null|
+----+
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).

  was:
h3. Describe the bug

When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
{{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces unexpectedly 
evaluate the invalid value to {{{}NULL{}}}, instead of throwing an exception.
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:

 
{code:java}
$SPARK_HOME/bin/spark-sql{code}
 

Execute the following:
{code:java}
spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
timestamp);
spark-sql> select * from timestamp_vals;
NULL{code}
 
Using {{{}spark-shell{}}}:
{code:java}
$SPARK_HOME/bin/spark-shell{code}
 
Execute the following:
{code:java}
scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
").toDF("time").select(to_timestamp(col("ti 
me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0))))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[721] at parallelize at <console>:28
scala> val schema = new StructType().add(StructField("c1", TimestampType,  
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,TimestampType,true))
scala> val df = spark.createDataFrame(rdd, schema)
df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
scala> df.show(false)
+----+
|c1  |
+----+
|null|
+----+
{code}
h3. Expected behavior

We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an exception 
for an invalid DATE/TIMESTAMP, like what they do for most of the other data 
types (e.g. invalid value {{"foo"}} for {{INT}} data type).


> Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL
> -----------------------------------------------------------------
>
>                 Key: SPARK-40630
>                 URL: https://issues.apache.org/jira/browse/SPARK-40630
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell, SQL
>    Affects Versions: 3.2.1
>            Reporter: xsys
>            Priority: Major
>
> h3. Describe the bug
> When we construct a DataFrame with an invalid DATE/TIMESTAMP (e.g. 
> {{{}1969-12-31 23:59:59 B{}}}) via {{{}spark-shell{}}}, or insert an invalid 
> DATE/TIMESTAMP into a table via {{{}spark-sql{}}}, both interfaces 
> unexpectedly evaluate the invalid value to {{{}NULL{}}}, instead of throwing 
> an exception.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
> Execute the following:
> {code:java}
> spark-sql> create table timestamp_vals(c1 TIMESTAMP) stored as ORC;
> spark-sql> insert into timestamp_vals select cast(" 1969-12-31 23:59:59 B "as 
> timestamp);
> spark-sql> select * from timestamp_vals;
> NULL{code}
>  
> Using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
>  
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(Seq(" 1969-12-31 23:59:59 B 
> ").toDF("time").select(to_timestamp(col("ti 
> me")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0))))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[721] at parallelize at <console>:28
> scala> val schema = new StructType().add(StructField("c1", TimestampType,  
> true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,TimestampType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df194: org.apache.spark.sql.DataFrame = [c1: timestamp]
> scala> df.show(false)
> +----+
> |c1  |
> +----+
> |null|
> +----+
> {code}
> h3. Expected behavior
> We expect both {{spark-sql}} & {{spark-shell}} interfaces to throw an 
> exception for an invalid DATE/TIMESTAMP, like what they do for most of the 
> other data types (e.g. invalid value {{"foo"}} for {{INT}} data type).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40630) Both SparkSQL and DataFrame insert invalid DATE/TIMESTAMP as NULL

Reply via email to