Hello.  In Spark 4, loading a dataframe from a path that contains a wildcard 
produces a warning and a stack trace that doesn't happen in Spark 3.

>>> spark.read.load('s3a://ullswater-dev/uw01/temp/test_parquet/*.parquet')
25/07/22 08:33:38 WARN org.apache.spark.sql.execution.streaming.FileStreamSink: 
Assume no metadata directory. Error while looking for metadata directory in the 
path: s3a://ullswater-dev/uw01/temp/test_parquet/*.parquet.
java.io.FileNotFoundException: No such file or directory: 
s3a://ullswater-dev/uw01/temp/test_parquet/*.parquet
      at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4156)
      at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:4007)

I think it's due to the change from this

https://github.com/apache/spark/blob/v3.5.6/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L54

to this

https://github.com/apache/spark/blob/v4.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L56


S3AFileSystem.isDirectory(hdfsPath) does not throw an Exception if hdfsPath 
contains a wildcard, whereas S3AFileSystem.getFileStatus(hdfsPath).isDirectory 
does.

Is this a bug?  Thanks.


Reply via email to