Re: Handling Schema Variability and Applying Regex Patterns in Flink Job Configuration

Yu Chen Tue, 07 Nov 2023 02:30:45 -0800

Hi Arjun,

As stated in the document, 'This regex pattern should be matched with the 
absolute file path.'
Therefore, you should adjust your regular expression to match absolute paths.


Please let me know if there are any other problems.

Best,
Yu Chen

> 2023年11月7日 18:11，arjun s <arjunjoice...@gmail.com> 写道：
> 
> Hi Chen,
> I attempted to configure the 'source.path.regex-pattern' property in the 
> table settings as '^customer.*' to ensure that the Flink job only processes 
> file names starting with "customer" in the specified directory. However, it 
> appears that this configuration is not producing the expected results. Are 
> there any additional configurations or adjustments that need to be made? The 
> table script I used is as follows:
> CREATE TABLE sample (
>   col1 STRING,
>   col2 STRING,
>   col3 STRING,
>   col4 STRING,
>   file.path STRING NOT NULL METADATA
> ) WITH (
>   'connector' = 'filesystem',
>   'path' = 'file:///home/techuser/inputdata',
>   'format' = 'csv',
>   'source.path.regex-pattern' = '^customer.*',
>   'source.monitor-interval' = '10000'
> )
> Thanks in advance,
> Arjun
> 
> On Mon, 6 Nov 2023 at 20:56, Chen Yu <yuchen.e...@gmail.com> wrote:
> Hi Arjun,
> 
> If you can filter files by a regex pattern, I think the config 
> `source.path.regex-pattern`[1] maybe what you want.
> 
>   'source.path.regex-pattern' = '...',  -- optional: regex pattern to filter 
> files to read under the 
>                                         -- directory of `path` option. This 
> regex pattern should be
>                                         -- matched with the absolute file 
> path. If this option is set,
>                                         -- the connector  will recursive all 
> files under the directory
>                                         -- of `path` option
> 
> Best,
> Yu Chen
> 
> 
> [1] 
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/connectors/table/filesystem/
> 
> 发件人: arjun s <arjunjoice...@gmail.com>
> 发送时间: 2023年11月6日 20:50
> 收件人: user@flink.apache.org <user@flink.apache.org>
> 主题: Handling Schema Variability and Applying Regex Patterns in Flink Job 
> Configuration   Hi team,
> I'm currently utilizing the Table API function within my Flink job, with the 
> objective of reading records from CSV files located in a source directory. To 
> obtain the file names, I'm creating a table and specifying the schema using 
> the Table API in Flink. Consequently, when the schema matches, my Flink job 
> successfully submits and executes as intended. However, in cases where the 
> schema does not match, the job fails to submit. Given that the schema of the 
> files in the source directory is unpredictable, I'm seeking a method to 
> handle this situation.
> Create table query
> =============
> CREATE TABLE sample (col1 STRING,col2 STRING,col3 STRING,col4 
> STRING,file.path` STRING NOT NULL METADATA) WITH ('connector' = 
> 'filesystem','path' = 'file:///home/techuser/inputdata','format' = 
> 'csv','source.monitor-interval' = '10000')
> =============
> 
> Furthermore, I have a question about whether there's a way to read files from 
> the source directory based on a specific regex pattern. This is relevant in 
> our situation because only file names that match a particular pattern need to 
> be processed by the Flink job.
> 
> Thanks and Regards,
> Arjun

Re: Handling Schema Variability and Applying Regex Patterns in Flink Job Configuration

Reply via email to