[ https://issues.apache.org/jira/browse/SPARK-31675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517237#comment-17517237 ]
CHC commented on SPARK-31675: ----------------------------- Met the same problem, the SQL to reproduce the problem is shown below: {code:sql} CREATE TABLE `spark3_snap`( `id` string) PARTITIONED BY (`dt` string) STORED AS ORC LOCATION 'hdfs://path/to/spark3_snap'; -- The file system of the partition location is different from the file system of the table location, -- one is S3A, the other is HDFS alter table tmp.spark3_snap add partition (dt='2020-09-10') LOCATION 's3a://path/to/spark3_snap/dt=2020-09-10'; insert overwrite table tmp.spark3_snap partition(dt) select '10' id, '2020-09-09' dt union select '20' id, '2020-09-10' dt ; {code} And we will get an exception: {code:none} java.lang.IllegalArgumentException: Wrong FS: s3a://path/to/spark3_snap/dt=2020-09-10, expected: hdfs://cluster1 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:666) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:816) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:812) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:823) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.$anonfun$commitJob$6(HadoopMapReduceCommitProtocol.scala:194) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.$anonfun$commitJob$6$adapted(HadoopMapReduceCommitProtocol.scala:194) at scala.collection.immutable.Set$Set1.foreach(Set.scala:141) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:194) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:240) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:605) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:240) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:187) at ...... {code} I will submit a PR later to fix rename and delete files with different filesystem at the `HadoopMapReduceCommitProtocol` > Fail to insert data to a table with remote location which causes by hive > encryption check > ----------------------------------------------------------------------------------------- > > Key: SPARK-31675 > URL: https://issues.apache.org/jira/browse/SPARK-31675 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.6, 3.0.0, 3.1.0 > Reporter: Kent Yao > Priority: Major > > Before this fix https://issues.apache.org/jira/browse/HIVE-14380 in Hive > 2.2.0, when moving files from staging dir to the final table dir, Hive will > do encryption check for the srcPaths and destPaths > {code:java} > // Some comments here > if (!isSrcLocal) { > // For NOT local src file, rename the file > if (hdfsEncryptionShim != null && > (hdfsEncryptionShim.isPathEncrypted(srcf) || > hdfsEncryptionShim.isPathEncrypted(destf)) > && !hdfsEncryptionShim.arePathsOnSameEncryptionZone(srcf, destf)) > { > LOG.info("Copying source " + srcf + " to " + destf + " because HDFS > encryption zones are different."); > success = FileUtils.copy(srcf.getFileSystem(conf), srcf, > destf.getFileSystem(conf), destf, > true, // delete source > replace, // overwrite destination > conf); > } else { > {code} > The hdfsEncryptionShim instance holds a global FileSystem instance belong to > the default fileSystem. It causes failures when checking a path that belongs > to a remote file system. > For example, I > {code:sql} > key int NULL > # Detailed Table Information > Database bdms_hzyaoqin_test_2 > Table abc > Owner bdms_hzyaoqin > Created Time Mon May 11 15:14:15 CST 2020 > Last Access Thu Jan 01 08:00:00 CST 1970 > Created By Spark 2.4.3 > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1589181255] > Location hdfs://cluster2/user/warehouse/bdms_hzyaoqin_test.db/abc > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties [serialization.format=1] > Partition Provider Catalog > Time taken: 0.224 seconds, Fetched 18 row(s) > {code} > The table abc belongs to the remote hdfs 'hdfs://cluster2', and when we run > command below via a spark sql job with default fs is ' 'hdfs://cluster1' > {code:sql} > insert into bdms_hzyaoqin_test_2.abc values(1); > {code} > {code:java} > Error in query: java.lang.IllegalArgumentException: Wrong FS: > hdfs://cluster2/user/warehouse/bdms_hzyaoqin_test.db/abc/.hive-staging_hive_2020-05-11_17-10-27_123_6306294638950056285-1/-ext-10000/part-00000-badf2a31-ab36-4b60-82a1-0848774e4af5-c000, > expected: hdfs://cluster1 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org