This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new a921da8509a [SPARK-46443][SQL] Decimal precision and scale should 
decided by H2 dialect
a921da8509a is described below

commit a921da8509a19b2d23c30ad657725f760932236c
Author: Jiaan Geng <belie...@163.com>
AuthorDate: Fri Dec 22 09:55:00 2023 +0800

    [SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect
    
    ### What changes were proposed in this pull request?
    This PR fix a but by make JDBC dialect decide the decimal precision and 
scale.
    
    **How to reproduce the bug?**
    https://github.com/apache/spark/pull/44397 proposed DS V2 push down 
`PERCENTILE_CONT` and `PERCENTILE_DISC`.
    The bug fired when pushdown the below SQL to H2 JDBC.
    `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC 
NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"`
    
    **The root cause**
    `getQueryOutputSchema` used to get the output schema of query by call 
`JdbcUtils.getSchema`.
    The query for database H2 show below.
    `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC 
NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"`
    We can get the five variables from `ResultSetMetaData`, please refer:
    ```
    columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS 
FIRST)"
    dataType = 2
    typeName = "NUMERIC"
    fieldSize = 100000
    fieldScale = 50000
    ```
    Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls 
`DecimalType.bounded(precision, scale)` actually.
    The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`.
    At finally, `makeGetter` throws exception.
    ```
    Caused by: org.apache.spark.SparkArithmeticException: 
[DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max 
precision 38. SQLSTATE: 22003
            at 
org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48)
            at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124)
            at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
            at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408)
            at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552)
            at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408)
            at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406)
            at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358)
            at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)
    ```
    
    ### Why are the changes needed?
    This PR fix the bug that `JdbcUtils` can't get the correct decimal type.
    
    ### Does this PR introduce _any_ user-facing change?
    'Yes'.
    Fix a bug.
    
    ### How was this patch tested?
    Manual tests in https://github.com/apache/spark/pull/44397
    
    ### Was this patch authored or co-authored using generative AI tooling?
    'No'.
    
    Closes #44398 from beliefer/SPARK-46443.
    
    Authored-by: Jiaan Geng <belie...@163.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
---
 .../main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
index 3c9bc0ed691..c3b4092c8e3 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
@@ -34,7 +34,7 @@ import 
org.apache.spark.sql.connector.catalog.functions.UnboundFunction
 import org.apache.spark.sql.connector.catalog.index.TableIndex
 import org.apache.spark.sql.connector.expressions.{Expression, FieldReference, 
NamedReference}
 import org.apache.spark.sql.execution.datasources.jdbc.{JDBCOptions, JdbcUtils}
-import org.apache.spark.sql.types.{BooleanType, ByteType, DataType, 
DecimalType, ShortType, StringType}
+import org.apache.spark.sql.types.{BooleanType, ByteType, DataType, 
DecimalType, MetadataBuilder, ShortType, StringType}
 
 private[sql] object H2Dialect extends JdbcDialect {
   override def canHandle(url: String): Boolean =
@@ -57,6 +57,20 @@ private[sql] object H2Dialect extends JdbcDialect {
   override def isSupportedFunction(funcName: String): Boolean =
     supportedFunctions.contains(funcName)
 
+  override def getCatalystType(
+      sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): 
Option[DataType] = {
+    sqlType match {
+      case Types.NUMERIC if size > 38 =>
+        // H2 supports very large decimal precision like 100000. The max 
precision in Spark is only
+        // 38. Here we shrink both the precision and scale of H2 decimal to 
fit Spark, and still
+        // keep the ratio between them.
+        val scale = if (null != md) md.build().getLong("scale") else 0L
+        val selectedScale = (DecimalType.MAX_PRECISION * (scale.toDouble / 
size.toDouble)).toInt
+        Option(DecimalType(DecimalType.MAX_PRECISION, selectedScale))
+      case _ => None
+    }
+  }
+
   override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
     case StringType => Option(JdbcType("CLOB", Types.CLOB))
     case BooleanType => Some(JdbcType("BOOLEAN", Types.BOOLEAN))


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to