kbendick opened a new issue #3453:
URL: https://github.com/apache/iceberg/issues/3453
While testing the 0.12.1 release candidate, I tried running the `add_files`
procedure with an ORC table.
I ran it once, so the files were imported. I then dropped the table, created
a new table, and tried to reimport from the same path.
I got the following `NumberFormatException` when trying to parse the number
of imported files.
```
scala> spark.sql("CALL hive.system.add_files(table => 'hive.default.test2',
source_table => '`orc`.`hdfs://hdfs-box:8020/user/hive/warehouse/orc`')").show
java.lang.NumberFormatException: null
at java.lang.Long.parseLong(Long.java:552)
at java.lang.Long.parseLong(Long.java:631)
at
org.apache.iceberg.spark.procedures.AddFilesProcedure.lambda$importToIceberg$1(AddFilesProcedure.java:135)
at
org.apache.iceberg.spark.procedures.BaseProcedure.execute(BaseProcedure.java:85)
at
org.apache.iceberg.spark.procedures.BaseProcedure.modifyIcebergTable(BaseProcedure.java:74)
at
org.apache.iceberg.spark.procedures.AddFilesProcedure.importToIceberg(AddFilesProcedure.java:121)
at
org.apache.iceberg.spark.procedures.AddFilesProcedure.call(AddFilesProcedure.java:108)
at
org.apache.spark.sql.execution.datasources.v2.CallExec.run(CallExec.scala:33)
at
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
at
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
at
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:46)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
... 47 elided
```
The path in the HDFS existed, but there were no files there, so `null`
seems to have been returned as a result.
This seems to still exist in current Iceberg:
https://github.com/apache/iceberg/blob/1c158df94bf004d43867939841e75e4bfb941c16/spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java#L144
We should fail early if there are no files present in the path based
directory (or at least say number of files is zero).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]