[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christopher Highman updated SPARK-31962: ---------------------------------------- Description: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "createdFileTime" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("createdFileTime", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have a created been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas/") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "createdFileTime" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("createdFileTime", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas/") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have a created been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming. > Provide option to load files after a specified date when reading from a > folder path > ----------------------------------------------------------------------------------- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming > Affects Versions: 3.1.0 > Reporter: Christopher Highman > Priority: Minor > > When using structured streaming with a FileDataSource, I've encountered a > number of occasions where I want to be able to stream from a folder > containing any number of historical delta files in CSV format. When I start > reading from a folder, however, I might only care about files were created > after a certain time. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], > there is a method, _checkAndGlobPathIfNecessary,_ which appears create an > in-memory index of files for a given path. There may a rather clean > opportunity to consider options here. > Having the ability to provide an option specifying a timestamp by which to > begin globbing files would result in quite a bit of less complexity needed on > a consumer who leverages the ability to stream from a folder path but does > not have an interest in reading what could be thousands of files that are not > relevant. > One example to could be "createdFileTime" accepting a UTC datetime like below. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .option("createdFileTime", "2020-05-01 00:00:00") > .format("csv") > .load("/mnt/Deltas") > {code} > > If this option is specified, the expected behavior would be that files within > the _"/mnt/Deltas/"_ path must have a created been created at or later than > the specified time in order to be consumed for purposes of reading the files > in general or for purposes of structured streaming. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org