[ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:
----------------------------------------
    Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
     .option("header", "true")
     .option("delimiter", "\t")
     .format("csv")
     .load("/mnt/Deltas")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
     .option("header", "true")
     .option("delimiter", "\t")
     .option("createdFileTime", "2020-05-01 00:00:00")
     .format("csv")
     .load("/mnt/Deltas")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading the files in general or 
for purposes of structured streaming.

 

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
     .option("header", "true")
     .option("delimiter", "\t")
     .format("csv")
     .load("/mnt/Deltas")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
     .option("header", "true")
     .option("delimiter", "\t")
     .option("createdFileTime", "2020-05-01 00:00:00")
     .format("csv")
     .load("/mnt/Deltas")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have a created been created at or later than the 
specified time in order to be consumed for purposes of reading the files in 
general or for purposes of structured streaming.

 


> Provide option to load files after a specified date when reading from a 
> folder path
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-31962
>                 URL: https://issues.apache.org/jira/browse/SPARK-31962
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, Structured Streaming
>    Affects Versions: 3.1.0
>            Reporter: Christopher Highman
>            Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasions where I want to be able to stream from a folder 
> containing any number of historical delta files in CSV format.  When I start 
> reading from a folder, however, I might only care about files were created 
> after a certain time.
> {code:java}
> spark.readStream
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .format("csv")
>      .load("/mnt/Deltas")
> {code}
>  
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
>  there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
> in-memory index of files for a given path.  There may a rather clean 
> opportunity to consider options here.
> Having the ability to provide an option specifying a timestamp by which to 
> begin globbing files would result in quite a bit of less complexity needed on 
> a consumer who leverages the ability to stream from a folder path but does 
> not have an interest in reading what could be thousands of files that are not 
> relevant.
> One example to could be "createdFileTime" accepting a UTC datetime like below.
> {code:java}
> spark.readStream
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .option("createdFileTime", "2020-05-01 00:00:00")
>      .format("csv")
>      .load("/mnt/Deltas")
> {code}
>  
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have been created at or later than the 
> specified time in order to be consumed for purposes of reading the files in 
> general or for purposes of structured streaming.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to