[ https://issues.apache.org/jira/browse/SPARK-48989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot reassigned SPARK-48989: -------------------------------------- Assignee: Apache Spark > WholeStageCodeGen error resulting in NumberFormatException when calling > SUBSTRING_INDEX > --------------------------------------------------------------------------------------- > > Key: SPARK-48989 > URL: https://issues.apache.org/jira/browse/SPARK-48989 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 4.0.0 > Environment: This was tested from the {{spark-shell}}, in local mode. > All Spark versions were run with default settings. > Spark 4.0 SNAPSHOT: Exception. > Spark 4.0 Preview: Exception. > Spark 3.5.1: Success. > Reporter: Mithun Radhakrishnan > Assignee: Apache Spark > Priority: Major > Labels: pull-request-available > > One seems to run into a {{NumberFormatException}}, possibly from an error in > WholeStageCodeGen, when I exercise {{SUBSTRING_INDEX}} with a null row, thus: > {code:scala} > // Create integer table with one null. > sql( " SELECT num FROM VALUES (1), (2), (3), (NULL) AS (num) > ").repartition(1).write.mode("overwrite").parquet("/tmp/mytable") > // Exercise substring-index. > sql( " SELECT num, SUBSTRING_INDEX('a_a_a', '_', num) AS subs FROM > PARQUET.`/tmp/mytable` ").show() > {code} > On Spark 4.0 (HEAD, as of today, and with the preview-1), I see the following > exception: > {code} > java.lang.NumberFormatException: For input string: "columnartorow_value_0" > at > java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) > at java.base/java.lang.Integer.parseInt(Integer.java:668) > at > org.apache.spark.sql.catalyst.expressions.SubstringIndex.$anonfun$doGenCode$29(stringExpressions.scala:1660) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.$anonfun$defineCodeGen$3(Expression.scala:869) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.nullSafeCodeGen(Expression.scala:888) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.defineCodeGen(Expression.scala:868) > at > org.apache.spark.sql.catalyst.expressions.SubstringIndex.doGenCode(stringExpressions.scala:1659) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) > at > org.apache.spark.sql.catalyst.expressions.ToPrettyString.doGenCode(ToPrettyString.scala:62) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202) > at > org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:162) > at > org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$2(basicPhysicalOperators.scala:74) > at scala.collection.immutable.List.map(List.scala:247) > at scala.collection.immutable.List.map(List.scala:79) > at > org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:74) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.withSubExprEliminationExprs(CodeGenerator.scala:1085) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:74) > at > org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:200) > at > org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.ColumnarToRowExec.consume(Columnar.scala:68) > at > org.apache.spark.sql.execution.ColumnarToRowExec.doProduce(Columnar.scala:193) > at > org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:99) > {code} > The same query seems to run alright on Spark 3.5.x: > {code} > +----+-----+ > | num| subs| > +----+-----+ > | 1| a| > | 2| a_a| > | 3|a_a_a| > |NULL| NULL| > +----+-----+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org