[ https://issues.apache.org/jira/browse/FLINK-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671066#comment-16671066 ]
ASF GitHub Bot commented on FLINK-10168: ---------------------------------------- bowenli86 opened a new pull request #6979: [FLINK-10168] Add FileFilter interface and FileModTimeFilter which sets a read start position for files by modification time URL: https://github.com/apache/flink/pull/6979 ## What is the purpose of the change The motivation is 1. enabling users to set a read start position for files, so they can process files that are modified after a given timestamp 2. exposing more file information to users and providing them with a more flexible file filter interface to define their own filtering rules ## Brief change log - add `FileFilter` interface that users can access all available information of a file and set filtering rules - allow users to set `FileFilter` to `FileInputFormat` - add `FileModTimeFilter`, in which users can set a read start position for files so Flink only process files that are modified after the given timestamp ## Verifying this change This change added tests and can be verified as follows: - extended unit tests for FileInputFormat in `FileInputFormatTest` - added `FileModTimeFilterTest` ## Does this pull request potentially affect one of the following parts: - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes ) ## Documentation - Does this pull request introduce a new feature? (yes ) - If yes, how is the feature documented? - Documentation will be added in another PR in a different jira ticket ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > support reading files whose modification time is after a given timestamp > ------------------------------------------------------------------------ > > Key: FLINK-10168 > URL: https://issues.apache.org/jira/browse/FLINK-10168 > Project: Flink > Issue Type: Improvement > Components: DataStream API > Affects Versions: 1.6.0 > Reporter: Bowen Li > Assignee: Bowen Li > Priority: Major > Labels: pull-request-available > Fix For: 1.8.0 > > > Update: The motivation is 1) enabling users to set a read start position for > files, so they can process files that are modified after a given timestamp 2) > expose more file information to users and providing them with a more flexible > file filter interface to define their own filtering rules > --------------- > support filtering files by modified/created time in > {{StreamExecutionEnvironment.readFile()}} > for example, in a source dir with lots of file, we only want to read files > that is created or modified after a specific time. > This API can expose a generic filter function of files, and let users define > filtering rules. Currently Flink only supports filtering files by path. What > this means is that, currently the API is > {{FileInputFormat.setFilesFilters(PathFiter)}} that takes only one file path > filter. A more generic API that can take more filters can look like this 1) > {{FileInputFormat.setFilesFilters(List (PathFiter, ModifiedTileFilter, ... > ))}} > 2) or {{FileInputFormat.setFilesFilters(FileFiter),}} and {{FileFilter}} > exposes all file attributes that Flink's file system can provide, like path > and modified time > I lean towards the 2nd option, because it gives users more flexibility to > define complex filtering rules based on combinations of file attributes. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)