I agree that having both modes and let the user choose the one he/she wants is 
the best option (I don't see big arguments on this honestly). Once we have 
this, I don't see big differences on what is the default. What - I think - we 
still have to work on, is to go ahead with the "strict mode" work and provide a 
more convenient way for users to switch among the 2 options. I mean: currently 
we have one flag for throwing exception on overflow for operations on decimals, 
one for doing the same for operations on other data types and probably going 
ahead we will have more. I think in the end we will need to collect them all 
under an "umbrella" flag which lets the user simply switch between strict and 
non-strict mode. I also think that we will need to document this very well and 
give it particular attention in our docs, maybe with a dedicated section, in 
order to provide enough visibility on it to end users.

 

I’m +1 on adding a strict mode flag this way, but I’m undecided on whether or 
not we want a separate flag for each of the arithmetic overflow situations that 
could produce invalid results. My intuition is yes, because different users 
have different levels of tolerance for different kinds of errors. I’d expect 
these sorts of configurations to be set up at an infrastructure level, e.g. to 
maintain consistent standards throughout a whole organization.

 

From: Gengliang Wang <gengliang.w...@databricks.com>
Date: Thursday, August 1, 2019 at 3:07 AM
To: Marco Gaido <marcogaid...@gmail.com>
Cc: Wenchen Fan <cloud0...@gmail.com>, Hyukjin Kwon <gurwls...@gmail.com>, 
Russell Spitzer <russell.spit...@gmail.com>, Ryan Blue <rb...@netflix.com>, 
Reynold Xin <r...@databricks.com>, Matt Cheah <mch...@palantir.com>, Takeshi 
Yamamuro <linguin....@gmail.com>, Spark dev list <dev@spark.apache.org>
Subject: Re: [Discuss] Follow ANSI SQL on table insertion

 

Hi all,

 

Let me explain a little bit on the proposal.

By default, we follow the store assignment rules in table insertion. On invalid 
casting, the result is null. It's better than the behavior in Spark 2.x while 
keeping backward-compatibility. It is 

If users can't torrent the silently corrupting, they can enable the new mode 
which throws runtime exceptions.

The proposal itself is quite complete. It satisfies different users to some 
degree.

 

It is hard to avoid null in data processing anyway. For example, 

> select 2147483647 + 1

2147483647 is the max value of Int. And the result data type of pulsing two 
integers are supposed to be Integer type. Since the value of (2147483647 + 1) 
can't fit into Int, I think Spark return null or throw runtime exceptions in 
such case. (Someone can argue that we can always convert the result as wider 
types, but that's another topic about performance and DBMS behaviors)

 

So, give a table t with an Int column, checking data type with Up-Cast can't 
avoid possible null values in the following SQL, as the result data type of 
(int_column_a + int_column_b) is int type.

>  insert into t select int_column_a + int_column_b from tbl_a, tbl_b;

 

Furthermore, if Spark uses Up-Cast and a user's existing ETL job failed because 
of that, what should he/she do then? I think he/she will try adding "cast" to 
queries first. Maybe a project for unifying data schema over all data sources 
has to be done later on if he/she has enough resource. The upgrade can be 
painful because of the strict rules of Up-Cast, while the user scenario might 
be able to tolerate converting Double to Decimal, or Timestamp to Date. 

 

 

Gengliang

 

On Thu, Aug 1, 2019 at 4:55 PM Marco Gaido <marcogaid...@gmail.com> wrote:

Hi all, 

 

I agree that having both modes and let the user choose the one he/she wants is 
the best option (I don't see big arguments on this honestly). Once we have 
this, I don't see big differences on what is the default. What - I think - we 
still have to work on, is to go ahead with the "strict mode" work and provide a 
more convenient way for users to switch among the 2 options. I mean: currently 
we have one flag for throwing exception on overflow for operations on decimals, 
one for doing the same for operations on other data types and probably going 
ahead we will have more. I think in the end we will need to collect them all 
under an "umbrella" flag which lets the user simply switch between strict and 
non-strict mode. I also think that we will need to document this very well and 
give it particular attention in our docs, maybe with a dedicated section, in 
order to provide enough visibility on it to end users.

 

Thanks,

Marco

 

Il giorno gio 1 ago 2019 alle ore 09:42 Wenchen Fan <cloud0...@gmail.com> ha 
scritto:

Hi Hyukjin, I think no one here is against the SQL standard behavior, which is 
no corrupted data + runtime exception. IIUC the main argument here is: shall we 
still keep the existing "return null for invalid operations" behavior as 
default? 

 

Traditional RDBMS is usually used as the final destination of CLEAN data. It's 
understandable that they need high data quality and they try their best to 
avoid corrupted data at any cost.

 

However, Spark is different. AFAIK Spark is usually used as an ETL tool, which 
needs to deal with DIRTY data. It's convenient if Spark returns null for 
invalid data and then I can filter them out later. I agree that "null" doesn't 
always mean invalid data, but it usually does. The "return null" behavior is 
there for many years and I don't see many people complain about it in the past.

 

Recently there are several new projects at the storage side, which can host 
CLEAN data and can connect to Spark. This does raise a new requirement in Spark 
to improve data quality. The community is already working on it: now arithmetic 
operation fails if overflow happens(need to set a config). We are going to 
apply the same to cast as well, and any other expressions that return null for 
invalid input data.

 

Rome was not built in a day. I think we need to keep the legacy "return null" 
behavior as default for a while, until we have enough confidence about the new 
ANSI mode. The default behavior should fit the majority of the users. I believe 
currently ETL is still the main use case of Spark and the "return null" 
behavior is useful to a lot of users.

 

On Thu, Aug 1, 2019 at 8:55 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:

I am sorry I am asking a question without reading whole discussion after I 
replied.

But why does Spark specifically needs to to it differently while ANCI standard, 
other DBMSes and other systems do?
If there isn't a specific issue to Spark, that basically says they are all 
wrong.

 

2019년 8월 1일 (목) 오전 9:31, Russell Spitzer <russell.spit...@gmail.com>님이 작성:

Another solution along those lines that I know we implemented for limited 
precision types is just to do a loud warning whenever you do such a cast. IE: 
Warning we are casting X to Y this may result in data loss.

 

On Wed, Jul 31, 2019 at 7:08 PM Russell Spitzer <russell.spit...@gmail.com> 
wrote:

I would argue "null" doesn't have to mean invalid. It could mean missing or 
deleted record. There is a lot of difference between missing record and invalid 
record.

I definitely have no problem with two modes, but I think setting a parameter to 
enable lossy conversions is a fine tradeoff to avoid data loss for others. The 
impact then for those who don't care about lossy casting is an analysis level 
message "Types don't match, to enable lossy casting set some parameter" while 
the impact in the other direction is possibly invisible until it hits something 
critical downstream.

 

On Wed, Jul 31, 2019 at 6:50 PM Ryan Blue <rb...@netflix.com> wrote:

> you guys seem to be arguing no those users don't know what they are doing and 
> they should not exist. 

 

I'm not arguing that they don't exist. Just that the disproportionate impact of 
awareness about this behavior is much worse for people that don't know about 
it. And there are a lot of those people as well.

 

On Wed, Jul 31, 2019 at 4:48 PM Ryan Blue <rb...@netflix.com> wrote:

> "between a runtime error and an analysis-time error" → I think one of those 
> should be the default. 

 

If you're saying that the default should be an error of some kind, then I think 
we agree. I'm also fine with having a mode that allows turning off the error 
and silently replacing values with NULL... as long as it isn't the default and 
I can set the default for my platform to an analysis-time error.

 

On Wed, Jul 31, 2019 at 4:42 PM Russell Spitzer <russell.spit...@gmail.com> 
wrote:

I definitely view it as silently corrupting. If i'm copying over a dataset 
where some elements are null and some have values, how do I differentiate 
between my expected nulls and those that were added in silently in the cast? 

 

On Wed, Jul 31, 2019 at 6:15 PM Reynold Xin <r...@databricks.com> wrote:

"between a runtime error and an analysis-time error" → I think one of those 
should be the default.

 

Maybe we are talking past each other or I wasn't explaining clearly, but I 
don't think you understand what I said and the use cases out there. I as an end 
user could very well be fully aware of the consequences of exceptional values 
but I can choose to ignore them. This is especially common for data scientists 
who are doing some quick and dirty analysis or exploration. You can't deny this 
large class of use cases out there (probably makes up half of Spark use cases 
actually).

 

Also writing out the exceptional cases as null are not silently corrupting 
them. The engine is sending an explicit signal that the value is no longer 
valid given the constraint.

 

Not sure if this is the best analogy, but think about checked exceptions in 
Java. It's great for writing low level code in which error handling is 
paramount, e.g. storage systems, network layers. But in most high level 
applications people just write boilerplate catches that are no-ops, because 
they have other priorities and they can tolerate mishandling of exceptions, 
although often maybe they shouldn't.

 

 

 

On Wed, Jul 31, 2019 at 2:55 PM, Ryan Blue <rb...@netflix.com> wrote:

Another important aspect of this problem is whether a user is conscious of the 
cast that is inserted by Spark. Most of the time, users are not aware of casts 
that are implicitly inserted, and that means replacing values with NULL would 
be a very surprising behavior. The impact of this choice affects users 
disproportionately: someone that knows about inserted casts is mildly annoyed 
when required to add an explicit cast, but a user that doesn't know an inserted 
cast is dropping values is very negatively impacted and may not discover the 
problem until it is too late.

 

That disproportionate impact is what makes me think that it is not okay for 
Spark to silently replace values with NULL, even if that's what ANSI would 
allow. Other databases also have the ability to reject null values in tables, 
providing extra insurance against the problem, but Spark doesn't have required 
columns in its DDL.

 

So while I agree with Reynold that there is a trade-off, I think that trade-off 
makes the choice between a runtime error and an analysis-time error. I'm okay 
with either a runtime error as the default or an analysis error as the default, 
as long as there is a setting that allows me to choose one for my deployment.

 

 

On Wed, Jul 31, 2019 at 10:39 AM Reynold Xin <r...@databricks.com> wrote:

OK to push back: "disagreeing with the premise that we can afford to not be 
maximal on standard 3. The correctness of the data is non-negotiable, and 
whatever solution we settle on cannot silently adjust the user’s data under any 
circumstances."

 

This blanket statement sounds great on surface, but there are a lot of 
subtleties. "Correctness" is absolutely important, but engineering/prod 
development are often about tradeoffs, and the industry has consistently traded 
correctness for performance or convenience, e.g. overflow checks, null 
pointers, consistency in databases ...

 

It all depends on the use cases and to what degree use cases can tolerate. For 
example, while I want my data engineering production pipeline to throw any 
error when the data doesn't match my expectations (e.g. type widening, 
overflow), if I'm doing some quick and dirty data science, I don't want the job 
to just fail due to outliers.

 

 

 

On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah <mch...@palantir.com> wrote:

Sorry I meant the current behavior for V2, which fails the query compilation if 
the cast is not safe.

 

Agreed that a separate discussion about overflow might be warranted. I’m 
surprised we don’t throw an error now, but it might be warranted to do so.

 

-Matt Cheah

 

From: Reynold Xin <r...@databricks.com>
Date: Wednesday, July 31, 2019 at 9:58 AM
To: Matt Cheah <mch...@palantir.com>
Cc: Russell Spitzer <russell.spit...@gmail.com>, Takeshi Yamamuro 
<linguin....@gmail.com>, Gengliang Wang <gengliang.w...@databricks.com>, Ryan 
Blue <rb...@netflix.com>, Spark dev list <dev@spark.apache.org>, Hyukjin Kwon 
<gurwls...@gmail.com>, Wenchen Fan <cloud0...@gmail.com>
Subject: Re: [Discuss] Follow ANSI SQL on table insertion

 

Error! Filename not specified.

Matt what do you mean by maximizing 3, while allowing not throwing errors when 
any operations overflow? Those two seem contradicting.

 

 

On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah <mch...@palantir.com> wrote:

I’m -1, simply from disagreeing with the premise that we can afford to not be 
maximal on standard 3. The correctness of the data is non-negotiable, and 
whatever solution we settle on cannot silently adjust the user’s data under any 
circumstances.

 

I think the existing behavior is fine, or perhaps the behavior can be flagged 
by the destination writer at write time.

 

-Matt Cheah

 

From: Hyukjin Kwon <gurwls...@gmail.com>
Date: Monday, July 29, 2019 at 11:33 PM
To: Wenchen Fan <cloud0...@gmail.com>
Cc: Russell Spitzer <russell.spit...@gmail.com>, Takeshi Yamamuro 
<linguin....@gmail.com>, Gengliang Wang <gengliang.w...@databricks.com>, Ryan 
Blue <rb...@netflix.com>, Spark dev list <dev@spark.apache.org>
Subject: Re: [Discuss] Follow ANSI SQL on table insertion

 

>From my look, +1 on the proposal, considering ASCI and other DBMSes in general.

 

2019년 7월 30일 (화) 오후 3:21, Wenchen Fan <cloud0...@gmail.com>님이 작성:

We can add a config for a certain behavior if it makes sense, but the most 
important thing we want to reach an agreement here is: what should be the 
default behavior? 

 

Let's explore the solution space of table insertion behavior first:

At compile time,

1. always add cast

2. add cast following the ASNI SQL store assignment rule (e.g. string to int is 
forbidden but long to int is allowed)

3. only add cast if it's 100% safe

At runtime,

1. return null for invalid operations

2. throw exceptions at runtime for invalid operations

 

The standards to evaluate a solution:

1. How robust the query execution is. For example, users usually don't want to 
see the query fails midway.

2. how tolerant to user queries. For example, a user would like to write long 
values to an int column as he knows all the long values won't exceed int range.

3. How clean the result is. For example, users usually don't want to see 
silently corrupted data (null values).

 

The current Spark behavior for Data Source V1 tables: always add cast and 
return null for invalid operations. This maximizes standard 1 and 2, but the 
result is least clean and users are very likely to see silently corrupted data 
(null values).

 

The current Spark behavior for Data Source V2 tables (new in Spark 3.0): only 
add cast if it's 100% safe. This maximizes standard 1 and 3, but many queries 
may fail to compile, even if these queries can run on other SQL systems. Note 
that, people can still see silently corrupted data because cast is not the only 
one that can return corrupted data. Simple operations like ADD can also return 
corrected data if overflow happens. e.g. INSERT INTO t1 (intCol) SELECT 
anotherIntCol + 100 FROM t2 

 

The proposal here: add cast following ANSI SQL store assignment rule, and 
return null for invalid operations. This maximizes standard 1, and also fits 
standard 2 well: if a query can't compile in Spark, it usually can't compile in 
other mainstream databases as well. I think that's tolerant enough. For 
standard 3, this proposal doesn't maximize it but can avoid many invalid 
operations already.

 

Technically we can't make the result 100% clean at compile-time, we have to 
handle things like overflow at runtime. I think the new proposal makes more 
sense as the default behavior.

  

 

On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer <russell.spit...@gmail.com> 
wrote:

I understand spark is making the decisions, i'm say the actual final effect of 
the null decision would be different depending on the insertion target if the 
target has different behaviors for null.

 

On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> I'm a big -1 on null values for invalid casts. 

 

This is why we want to introduce the ANSI mode, so that invalid cast fails at 
runtime. But we have to keep the null behavior for a while, to keep backward 
compatibility. Spark returns null for invalid cast since the first day of Spark 
SQL, we can't just change it without a way to restore to the old behavior.

 

I'm OK with adding a strict mode for the upcast behavior in table insertion, 
but I don't agree with making it the default. The default behavior should be 
either the ANSI SQL behavior or the legacy Spark behavior.

 

> other modes should be allowed only with strict warning the behavior will be 
> determined by the underlying sink.

 

Seems there is some misunderstanding. The table insertion behavior is fully 
controlled by Spark. Spark decides when to add cast and Spark decided whether 
invalid cast should return null or fail. The sink is only responsible for 
writing data, not the type coercion/cast stuff.

 

On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <russell.spit...@gmail.com> 
wrote:

I'm a big -1 on null values for invalid casts. This can lead to a lot of even 
more unexpected errors and runtime behavior since null is  

 

1. Not allowed in all schemas (Leading to a runtime error anyway)
2. Is the same as delete in some systems (leading to data loss)

And this would be dependent on the sink being used. Spark won't just be 
interacting with ANSI compliant sinks so I think it makes much more sense to be 
strict. I think Upcast mode is a sensible default and other modes should be 
allowed only with strict warning the behavior will be determined by the 
underlying sink.

 

On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <linguin....@gmail.com> wrote:

Hi, all 

 

+1 for implementing this new store cast mode.

>From a viewpoint of DBMS users, this cast is pretty common for INSERTs and I 
>think this functionality could

promote migrations from existing DBMSs to Spark. 

 

The most important thing for DBMS users is that they could optionally choose 
this mode when inserting data.

Therefore, I think it might be okay that the two modes (the current upcast mode 
and the proposed store cast mode)

co-exist for INSERTs. (There is a room to discuss which mode  is enabled by 
default though...)

 

IMHO we'll provide three behaviours below for INSERTs;

 - upcast mode

 - ANSI store cast mode and runtime exceptions thrown for invalid values

 - ANSI store cast mode and null filled for invalid values

 

 

On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <gengliang.w...@databricks.com> 
wrote:

Hi Ryan, 

 

Thanks for the suggestions on the proposal and doc.

Currently, there is no data type validation in table insertion of V1. We are on 
the same page that we should improve it. But using UpCast is from one extreme 
to another. It is possible that many queries are broken after upgrading to 
Spark 3.0. 

The rules of UpCast are too strict. E.g. it doesn't allow assigning Timestamp 
type to Date Type, as there will be "precision loss". To me, the type coercion 
is reasonable and the "precision loss" is under expectation. This is very 
common in other SQL engines. 

As long as Spark is following the ANSI SQL store assignment rules, it is users' 
responsibility to take good care of the type coercion in data writing. I think 
it's the right decision.

 

> But the new behavior is only applied in DataSourceV2, so it won’t affect 
> existing jobs until sources move to v2 and break other behavior anyway.

Eventually, most sources are supposed to be migrated to DataSourceV2 V2. I 
think we can discuss and make a decision now.

 

> Fixing the silent corruption by adding a runtime exception is not a good 
> option, either. 

The new optional mode proposed in 
https://issues.apache.org/jira/browse/SPARK-28512 [issues.apache.org] is 
disabled by default. This should be fine.

 

 

 

On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0...@gmail.com> wrote:

I don't agree with handling literal values specially. Although Postgres does 
it, I can't find anything about it in the SQL standard. And it introduces 
inconsistent behaviors which may be strange to users: 

* What about something like "INSERT INTO t SELECT float_col + 1.1"?
* The same insert with a decimal column as input will fail even when a decimal 
literal would succeed
* Similar insert queries with "literal" inputs can be constructed through 
layers of indirection via views, inline views, CTEs, unions, etc. Would those 
decimals be treated as columns and fail or would we attempt to make them 
succeed as well? Would users find this behavior surprising?

 

Silently corrupt data is bad, but this is the decision we made at the beginning 
when design Spark behaviors. Whenever an error occurs, Spark attempts to return 
null instead of runtime exception. Recently we provide configs to make Spark 
fail at runtime for overflow, but that's another story. Silently corrupt data 
is bad, runtime exception is bad, and forbidding all the table insertions that 
may fail(even with very little possibility) is also bad. We have to make 
trade-offs. The trade-offs we made in this proposal are:

* forbid table insertions that are very like to fail, at compile time. (things 
like writing string values to int column)

* allow table insertions that are not that likely to fail. If the data is 
wrong, don't fail, insert null.

* provide a config to fail the insertion at runtime if the data is wrong.

 

>  But the new behavior is only applied in DataSourceV2, so it won’t affect 
> existing jobs until sources move to v2 and break other behavior anyway.

When users write SQL queries, they don't care if a table is backed by Data 
Source V1 or V2. We should make sure the table insertion behavior is consistent 
and reasonable. Furthermore, users may even not care if the SQL queries are run 
in Spark or other RDBMS, it's better to follow SQL standard instead of 
introducing a Spark-specific behavior.

 

We are not talking about a small use case like allowing writing decimal literal 
to float column, we are talking about a big goal to make Spark compliant to SQL 
standard, w.r.t. https://issues.apache.org/jira/browse/SPARK-26217 
[issues.apache.org] . This proposal is a sub-task of it, to make the table 
insertion behavior follow SQL standard.

 

On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rb...@netflix.com> wrote:

I don’t think this is a good idea. Following the ANSI standard is usually fine, 
but here it would silently corrupt data.

>From your proposal doc, ANSI allows implicitly casting from long to int (any 
>numeric type to any other numeric type) and inserts NULL when a value 
>overflows. That would drop data values and is not safe.

Fixing the silent corruption by adding a runtime exception is not a good 
option, either. That puts off the problem until much of the job has completed, 
instead of catching the error at analysis time. It is better to catch this 
earlier during analysis than to run most of a job and then fail.

In addition, part of the justification for using the ANSI standard is to avoid 
breaking existing jobs. But the new behavior is only applied in DataSourceV2, 
so it won’t affect existing jobs until sources move to v2 and break other 
behavior anyway.

I think that the correct solution is to go with the existing validation rules 
that require explicit casts to truncate values.

That still leaves the use case that motivated this proposal, which is that 
floating point literals are parsed as decimals and fail simple insert 
statements. We already came up with two alternatives to fix that problem in the 
DSv2 sync and I think it is a better idea to go with one of those instead of 
“fixing” Spark in a way that will corrupt data or cause runtime failures.

 

On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0...@gmail.com> wrote:

I have heard about many complaints about the old table insertion behavior. 
Blindly casting everything will leak the user mistake to a late stage of the 
data pipeline, and make it very hard to debug. When a user writes string values 
to an int column, it's probably a mistake and the columns are misordered in the 
INSERT statement. We should fail the query earlier and ask users to fix the 
mistake. 

 

In the meanwhile, I agree that the new table insertion behavior we introduced 
for Data Source V2 is too strict. It may fail valid queries unexpectedly.

 

In general, I support the direction of following the ANSI SQL standard. But I'd 
like to do it with 2 steps:

1. only add cast when the assignment rule is satisfied. This should be the 
default behavior and we should provide a legacy config to restore to the old 
behavior.

2. fail the cast operation at runtime if overflow happens. AFAIK Marco Gaido is 
working on it already. This will have a config as well and by default we still 
return null.

 

After doing this, the default behavior will be slightly different from the SQL 
standard (cast can return null), and users can turn on the ANSI mode to fully 
follow the SQL standard. This is much better than before and should prevent a 
lot of user mistakes. It's also a reasonable choice to me to not throw 
exceptions at runtime by default, as it's usually bad for long-running jobs.

 

Thanks,

Wenchen 

 

On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <gengliang.w...@databricks.com> 
wrote:

Hi everyone, 

 

I would like to discuss the table insertion behavior of Spark. In the current 
data source V2, only UpCast is allowed for table insertion. I think following 
ANSI SQL is a better idea.

For more information, please read the Discuss: Follow ANSI SQL on table 
insertion [docs.google.com]

Please let me know if you have any thoughts on this.

 

Regards,

Gengliang


 

-- 

Ryan Blue 

Software Engineer

Netflix


 

-- 

---
Takeshi Yamamuro

 


 

-- 

Ryan Blue 

Software Engineer

Netflix

 


 

-- 

Ryan Blue 

Software Engineer

Netflix


 

-- 

Ryan Blue 

Software Engineer

Netflix

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to