[GitHub] spark pull request #23042: [SPARK-26070][SQL] add rule for implicit type coe...

2018-11-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23042#discussion_r233816966
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -138,6 +138,11 @@ object TypeCoercion {
 case (DateType, TimestampType)
   => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) 
else Some(StringType)
 
+// to support a popular use case of tables using Decimal(X, 0) for 
long IDs instead of strings
+// see SPARK-26070 for more details
+case (n: DecimalType, s: StringType) if n.scale == 0 => 
Some(DecimalType(n.precision, n.scale))
--- End diff --

what if the decimal is (1, 0) and the string is something like `.`?

The string can be anything: a very big integer, a fraction with many digits 
after the dot, etc. I don't think there is a perfect solution, casting to 
double is the best we can do here.

I'd suggest end users to manually do the cast which fits their data best.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23042: [SPARK-26070][SQL] add rule for implicit type coe...

2018-11-15 Thread uzadude
Github user uzadude commented on a diff in the pull request:

https://github.com/apache/spark/pull/23042#discussion_r233972629
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -138,6 +138,11 @@ object TypeCoercion {
 case (DateType, TimestampType)
   => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) 
else Some(StringType)
 
+// to support a popular use case of tables using Decimal(X, 0) for 
long IDs instead of strings
+// see SPARK-26070 for more details
+case (n: DecimalType, s: StringType) if n.scale == 0 => 
Some(DecimalType(n.precision, n.scale))
--- End diff --

yes.. I see what you mean. I agree. However, this wrong implicit type 
coercion is a huge bug potential (evidently we've found it in a few places) 
that causes wrong results.
what do you say that along the lines of SPARK-21646, we'll add another flag 
of "typeCoercion.mode" which will be a "safe mode". Just throw an 
AnalysisExcpetion when the user tries to compare unsafe types?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23042: [SPARK-26070][SQL] add rule for implicit type coe...

2018-11-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23042#discussion_r234091858
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -138,6 +138,11 @@ object TypeCoercion {
 case (DateType, TimestampType)
   => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) 
else Some(StringType)
 
+// to support a popular use case of tables using Decimal(X, 0) for 
long IDs instead of strings
+// see SPARK-26070 for more details
+case (n: DecimalType, s: StringType) if n.scale == 0 => 
Some(DecimalType(n.precision, n.scale))
--- End diff --

CC @gatorsmile @mgaido91 I think it's time to look at the SQL standard and 
other mainstream databases, and see how shall we update the type coercions 
rules with safe mode. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23042: [SPARK-26070][SQL] add rule for implicit type coe...

2018-11-16 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23042#discussion_r234155688
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -138,6 +138,11 @@ object TypeCoercion {
 case (DateType, TimestampType)
   => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) 
else Some(StringType)
 
+// to support a popular use case of tables using Decimal(X, 0) for 
long IDs instead of strings
+// see SPARK-26070 for more details
+case (n: DecimalType, s: StringType) if n.scale == 0 => 
Some(DecimalType(n.precision, n.scale))
--- End diff --

@cloud-fan I think we have seen many issues on this. I don't think there is 
a standard for them, every RDBMS has different rules. The worst thing about the 
current rules IMHO is that they are not even coherent in Spark (see #19635 for 
instance).

The option I'd prefer is to follow Postgres behavior, ie. no implicit cast 
at all. When there is a type mismatch the user has to choose how to cast the 
things. It is a bit more effort on user side, but it is the safest option IMHO.

What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23042: [SPARK-26070][SQL] add rule for implicit type coe...

2018-11-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23042#discussion_r234401696
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -138,6 +138,11 @@ object TypeCoercion {
 case (DateType, TimestampType)
   => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) 
else Some(StringType)
 
+// to support a popular use case of tables using Decimal(X, 0) for 
long IDs instead of strings
+// see SPARK-26070 for more details
+case (n: DecimalType, s: StringType) if n.scale == 0 => 
Some(DecimalType(n.precision, n.scale))
--- End diff --

> no implicit cast at all

Is that too strict? I feel it's OK to compare an int with long. Maybe we 
should come up with a list of "definitely safe" type coercions, and allow them 
only.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23042: [SPARK-26070][SQL] add rule for implicit type coe...

2018-11-17 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23042#discussion_r234403917
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -138,6 +138,11 @@ object TypeCoercion {
 case (DateType, TimestampType)
   => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) 
else Some(StringType)
 
+// to support a popular use case of tables using Decimal(X, 0) for 
long IDs instead of strings
+// see SPARK-26070 for more details
+case (n: DecimalType, s: StringType) if n.scale == 0 => 
Some(DecimalType(n.precision, n.scale))
--- End diff --

I agree with you when you say that it is "too strict", as this is an 
extreme approach. But that's how Postgres works and I do believe it has some 
benefits over other behaviors. I'd argue just a couple of things about what you 
are suggesting:
 - if you are comparing an int and a long, you most likely have to better 
define your schemas (not always true, of course). For instance in Oracle if you 
are not careful you end up pretty easily having a variable like YEAR being 
stored as an integer in some tables, as a string in some others and so on. 
Because Oracle does implicit casting so nobody realizes this bad design. With 
Postgres you do realize if you have a messy schema design like that. If it is 
not the case and your schemas are fine and you do need a casting, you can 
always add it by the way;
 - I think it is very hard to determine which are the "definitely safe" 
type coercions. For instance is casting a DOUBLE to a DECIMAL definitely safe? 
And an INT to STRING? We may say that casting an INT to STRING is safe, because 
there is no error at all in the conversion, but if we have something like `2014 
= '2014 '` should we return true or false? Most likely the users wants a true 
there, and most likely we would return false, which the user may (or may even 
not) realize only at the end of the job (which may mean several hours).




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23042: [SPARK-26070][SQL] add rule for implicit type coe...

2018-11-17 Thread uzadude
Github user uzadude commented on a diff in the pull request:

https://github.com/apache/spark/pull/23042#discussion_r234431689
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -138,6 +138,11 @@ object TypeCoercion {
 case (DateType, TimestampType)
   => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) 
else Some(StringType)
 
+// to support a popular use case of tables using Decimal(X, 0) for 
long IDs instead of strings
+// see SPARK-26070 for more details
+case (n: DecimalType, s: StringType) if n.scale == 0 => 
Some(DecimalType(n.precision, n.scale))
--- End diff --

I personally agree with @cloud-fan that there are a few types that are 
"definitely safe", and as the user is not always responsible to his input 
tables, I believe convinience is more important than schema definitions. Also, 
even count() returns a bigint then you'll have to filter 'count(*)>100L' which 
means huge regression.
I believe that the "definitely safe" list is very short and we should use 
it. @mgaido91, in your examples I do agree that Double to Decimal is not safe 
and so is String to almost anything.
the trivial safes are something like (Long, Int), (Int, Double), (Decimal, 
Decimal) - that could be expanded to the same precision and scale, maybe (Data, 
TimeStamp)..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org