Configuration to disable file exists in DataSource

Romain Ardiet Tue, 16 Apr 2024 14:30:27 -0700

Hi community,

When using DataFrameReader to read parquet files located on s3, there is no
way to disable file existence checks done by the driver.


My use case is that I have a spark job reading list of s3 files generated
by an upstream job. This list can contain thousands of files.

The process is multi-threaded thanks to
https://issues.apache.org/jira/browse/SPARK-29089 but is redundant in my
case as the upstream job already verified files.

Would it make sense to add an option to control it?

Thanks,
Romain Ardiet

Configuration to disable file exists in DataSource

Reply via email to