This is an automated email from the ASF dual-hosted git repository. corgy pushed a commit to branch dev in repository https://gitbox.apache.org/repos/asf/seatunnel.git
The following commit(s) were added to refs/heads/dev by this push:
new 48fb7ef697 [Improve][Csv] support configurable CSV delimiter in file
connector (#9660)
48fb7ef697 is described below
commit 48fb7ef6974bbac543ec5dc645986b3a1d573b21
Author: zhenyue-xu <[email protected]>
AuthorDate: Mon Aug 11 01:08:32 2025 +0800
[Improve][Csv] support configurable CSV delimiter in file connector (#9660)
---
docs/en/connector-v2/sink/CosFile.md | 6 +-
docs/en/connector-v2/sink/FtpFile.md | 4 +-
docs/en/connector-v2/sink/HdfsFile.md | 4 +-
docs/en/connector-v2/sink/LocalFile.md | 6 +-
docs/en/connector-v2/sink/OssFile.md | 4 +-
docs/en/connector-v2/sink/OssJindoFile.md | 70 +++++++++++-----------
docs/en/connector-v2/sink/S3File.md | 6 +-
docs/en/connector-v2/sink/SftpFile.md | 6 +-
docs/en/connector-v2/source/CosFile.md | 58 +++++++++---------
docs/en/connector-v2/source/FtpFile.md | 64 ++++++++++----------
docs/en/connector-v2/source/HdfsFile.md | 64 ++++++++++----------
docs/en/connector-v2/source/LocalFile.md | 2 +-
docs/en/connector-v2/source/OssJindoFile.md | 54 ++++++++---------
docs/en/connector-v2/source/S3File.md | 2 +-
docs/en/connector-v2/source/SftpFile.md | 60 +++++++++----------
.../seatunnel/file/config/BaseFileSinkConfig.java | 19 ++++--
.../seatunnel/file/config/FileFormat.java | 1 -
.../file/sink/writer/CsvWriteStrategy.java | 5 +-
.../file/source/reader/CsvReadStrategy.java | 15 +++--
.../file/source/reader/CsvReadStrategyTest.java | 31 ++++++++++
.../file/writer/CsvWriteStrategyTest.java | 63 +++++++++++++++++++
.../src/test/resources/test-csv.csv | 3 +
.../test/resources/csv/local_csv_to_assert.conf | 2 +-
23 files changed, 331 insertions(+), 218 deletions(-)
diff --git a/docs/en/connector-v2/sink/CosFile.md
b/docs/en/connector-v2/sink/CosFile.md
index 36c3afbff4..0a544fae06 100644
--- a/docs/en/connector-v2/sink/CosFile.md
+++ b/docs/en/connector-v2/sink/CosFile.md
@@ -53,8 +53,8 @@ To use this connector you need put
hadoop-cos-{hadoop.version}-{version}.jar and
| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
| file_format_type | string | no | "csv"
|
|
| filename_extension | string | no | -
| Override the default file name extensions with
custom file name extensions. E.g. `.xml`, `.json`, `dat`, `.customtype`
|
-| field_delimiter | string | no | '\001'
| Only used when file_format is text
|
-| row_delimiter | string | no | "\n"
| Only used when file_format is `text`, `csv` and
`json`
|
+| field_delimiter | string | no | '\001' for text
and ',' for csv | Only used when file_format is text and csv
|
+| row_delimiter | string | no | "\n"
| Only used when file_format is `text`, `csv` and
`json`
|
| have_partition | boolean | no | false
| Whether you need processing partitions.
|
| partition_by | array | no | -
| Only used then have_partition is true
|
| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true
|
@@ -134,7 +134,7 @@ Please note that, The final file name will end with the
file_format's suffix, th
### field_delimiter [string]
-The separator between columns in a row of data. Only needed by `text` file
format.
+The separator between columns in a row of data. Only needed by `text` and
`csv` file format.
### row_delimiter [string]
diff --git a/docs/en/connector-v2/sink/FtpFile.md
b/docs/en/connector-v2/sink/FtpFile.md
index cb766a03e7..8bba44368d 100644
--- a/docs/en/connector-v2/sink/FtpFile.md
+++ b/docs/en/connector-v2/sink/FtpFile.md
@@ -53,7 +53,7 @@ By default, we use 2PC commit to ensure `exactly-once`
| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
| file_format_type | string | no | "csv"
|
|
| filename_extension | string | no | -
| Override the default file name extensions with
custom file name extensions. E.g. `.xml`, `.json`, `dat`, `.customtype`
|
-| field_delimiter | string | no | '\001'
| Only used when file_format_type is text
|
+| field_delimiter | string | no | '\001' for text
and ',' for csv | Only used when file_format_type is text and csv
|
| row_delimiter | string | no | "\n"
| Only used when file_format_type is `text`, `csv`
and `json`
|
| have_partition | boolean | no | false
| Whether you need processing partitions.
|
| partition_by | array | no | -
| Only used then have_partition is true
|
@@ -147,7 +147,7 @@ Please note that, The final file name will end with the
file_format_type's suffi
### field_delimiter [string]
-The separator between columns in a row of data. Only needed by `text` file
format.
+The separator between columns in a row of data. Only needed by `text` and
`csv` file format.
### row_delimiter [string]
diff --git a/docs/en/connector-v2/sink/HdfsFile.md
b/docs/en/connector-v2/sink/HdfsFile.md
index ec8e258d51..3c72f4b2c2 100644
--- a/docs/en/connector-v2/sink/HdfsFile.md
+++ b/docs/en/connector-v2/sink/HdfsFile.md
@@ -55,8 +55,8 @@ Output data to hdfs file
| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when `custom_filename` is `true`.When
the format in the `file_name_expression` parameter is `xxxx-${now}` ,
`filename_time_format` can specify the time format of the path, and the default
value is `yyyy.MM.dd` . The commonly used time formats are listed as
follows:[y:Year,M:Month,d:Day of month,H:Hour in day (0-23),m:Minute in
hour,s:Second in minute] [...]
| file_format_type | string | no | "csv"
| We supported as the following file types:`text`
`csv` `parquet` `orc` `json` `excel` `xml` `binary`.Please note that, The final
file name will end with the file_format's suffix, the suffix of the text file
is `txt`.
[...]
| filename_extension | string | no | -
| Override the default file name extensions with
custom file name extensions. E.g. `.xml`, `.json`, `dat`, `.customtype`
[...]
-| field_delimiter | string | no | '\001'
| Only used when file_format is text,The separator
between columns in a row of data. Only needed by `text` file format.
[...]
-| row_delimiter | string | no | "\n"
| Only used when file_format is text,The separator
between rows in a file. Only needed by `text`, `csv` and `json` file format.
[...]
+| field_delimiter | string | no | '\001' for text
and ',' for csv | Only used when file_format is text and csv,The
separator between columns in a row of data. Only needed by `text` file format.
[...]
+| row_delimiter | string | no | "\n"
| Only used when file_format is text,The separator
between rows in a file. Only needed by `text`, `csv` and `json` file format.
[...]
| have_partition | boolean | no | false
| Whether you need processing partitions.
[...]
| partition_by | array | no | -
| Only used then have_partition is true,Partition
data based on selected fields.
[...]
| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true,If the `partition_by` is specified, we will generate the corresponding
partition directory based on the partition information, and the final file will
be placed in the partition directory. Default `partition_dir_expression` is
`${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field
and `v0` is the value of the first partit [...]
diff --git a/docs/en/connector-v2/sink/LocalFile.md
b/docs/en/connector-v2/sink/LocalFile.md
index 40f7259a2b..df83f2acdb 100644
--- a/docs/en/connector-v2/sink/LocalFile.md
+++ b/docs/en/connector-v2/sink/LocalFile.md
@@ -48,8 +48,8 @@ By default, we use 2PC commit to ensure `exactly-once`
| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
| file_format_type | string | no | "csv"
|
|
| filename_extension | string | no | -
| Override the default file name extensions with
custom file name extensions. E.g. `.xml`, `.json`, `dat`, `.customtype`
|
-| field_delimiter | string | no | '\001'
| Only used when file_format_type is text
|
-| row_delimiter | string | no | "\n"
| Only used when file_format_type is `text`, `csv`
and `json`
|
+| field_delimiter | string | no | '\001' for text
and ',' for csv | Only used when file_format_type is text and csv
|
+| row_delimiter | string | no | "\n"
| Only used when file_format_type is `text`, `csv`
and `json`
|
| have_partition | boolean | no | false
| Whether you need processing partitions.
|
| partition_by | array | no | -
| Only used then have_partition is true
|
| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true
|
@@ -116,7 +116,7 @@ Please note that, The final file name will end with the
file_format_type's suffi
### field_delimiter [string]
-The separator between columns in a row of data. Only needed by `text` file
format.
+The separator between columns in a row of data. Only needed by `text` and
`csv` file format.
### row_delimiter [string]
diff --git a/docs/en/connector-v2/sink/OssFile.md
b/docs/en/connector-v2/sink/OssFile.md
index 1d27ed2e9d..e5c7bcbfb2 100644
--- a/docs/en/connector-v2/sink/OssFile.md
+++ b/docs/en/connector-v2/sink/OssFile.md
@@ -105,7 +105,7 @@ If write to `csv`, `text`, `json` file type, All column
will be string.
| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
| file_format_type | string | no | "csv"
|
|
| filename_extension | string | no | -
| Override the default file name extensions with
custom file name extensions. E.g. `.xml`, `.json`, `dat`, `.customtype`
|
-| field_delimiter | string | no | '\001'
| Only used when file_format_type is text
|
+| field_delimiter | string | no | '\001' for text
and ',' for csv | Only used when file_format_type is text and csv
|
| row_delimiter | string | no | "\n"
| Only used when file_format_type is `text`, `csv`
and `json`
|
| have_partition | boolean | no | false
| Whether you need processing partitions.
|
| partition_by | array | no | -
| Only used then have_partition is true
|
@@ -189,7 +189,7 @@ Please note that, The final file name will end with the
file_format_type's suffi
### field_delimiter [string]
-The separator between columns in a row of data. Only needed by `text` file
format.
+The separator between columns in a row of data. Only needed by `text` and
`csv` file format.
### row_delimiter [string]
diff --git a/docs/en/connector-v2/sink/OssJindoFile.md
b/docs/en/connector-v2/sink/OssJindoFile.md
index 65f75bd85d..941dd02421 100644
--- a/docs/en/connector-v2/sink/OssJindoFile.md
+++ b/docs/en/connector-v2/sink/OssJindoFile.md
@@ -44,41 +44,41 @@ It only supports hadoop version **2.9.X+**.
## Options
-| Name | Type | Required | Default
| Description
|
-|---------------------------------------|---------|----------|--------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| path | string | yes | -
|
|
-| tmp_path | string | no | /tmp/seatunnel
| The result file will write to a tmp path first and
then use `mv` to submit tmp dir to target dir. Need a OSS dir.
|
-| bucket | string | yes | -
|
|
-| access_key | string | yes | -
|
|
-| access_secret | string | yes | -
|
|
-| endpoint | string | yes | -
|
|
-| custom_filename | boolean | no | false
| Whether you need custom the filename
|
-| file_name_expression | string | no |
"${transactionId}" | Only used when custom_filename is
true
|
-| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
-| file_format_type | string | no | "csv"
|
|
-| filename_extension | string | no | -
| Override the default file name extensions with
custom file name extensions. E.g. `.xml`, `.json`, `dat`, `.customtype`
|
-| field_delimiter | string | no | '\001'
| Only used when file_format_type is text
|
-| row_delimiter | string | no | "\n"
| Only used when file_format_type is `text`, `csv`
and `json`
|
-| have_partition | boolean | no | false
| Whether you need processing partitions.
|
-| partition_by | array | no | -
| Only used then have_partition is true
|
-| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true
|
-| is_partition_field_write_in_file | boolean | no | false
| Only used then have_partition is true
|
-| sink_columns | array | no |
| When this parameter is empty, all fields are sink
columns
|
-| is_enable_transaction | boolean | no | true
|
|
-| batch_size | int | no | 1000000
|
|
-| compress_codec | string | no | none
|
|
-| common-options | object | no | -
|
|
-| max_rows_in_memory | int | no | -
| Only used when file_format_type is excel.
|
-| sheet_name | string | no | Sheet${Random
number} | Only used when file_format_type is excel.
|
-| csv_string_quote_mode | enum | no | MINIMAL
| Only used when file_format is csv.
|
-| xml_root_tag | string | no | RECORDS
| Only used when file_format is xml.
|
-| xml_row_tag | string | no | RECORD
| Only used when file_format is xml.
|
-| xml_use_attr_format | boolean | no | -
| Only used when file_format is xml.
|
+| Name | Type | Required | Default
| Description
|
+|---------------------------------------|---------|----------|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| path | string | yes | -
|
|
+| tmp_path | string | no | /tmp/seatunnel
| The result file will write to a tmp path first and
then use `mv` to submit tmp dir to target dir. Need a OSS dir.
|
+| bucket | string | yes | -
|
|
+| access_key | string | yes | -
|
|
+| access_secret | string | yes | -
|
|
+| endpoint | string | yes | -
|
|
+| custom_filename | boolean | no | false
| Whether you need custom the filename
|
+| file_name_expression | string | no |
"${transactionId}" | Only used when custom_filename is
true
|
+| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
+| file_format_type | string | no | "csv"
|
|
+| filename_extension | string | no | -
| Override the default file name extensions with
custom file name extensions. E.g. `.xml`, `.json`, `dat`, `.customtype`
|
+| field_delimiter | string | no | '\001' for text
and ',' for csv | Only used when file_format_type is text and csv
|
+| row_delimiter | string | no | "\n"
| Only used when file_format_type is `text`, `csv`
and `json`
|
+| have_partition | boolean | no | false
| Whether you need processing partitions.
|
+| partition_by | array | no | -
| Only used then have_partition is true
|
+| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true
|
+| is_partition_field_write_in_file | boolean | no | false
| Only used then have_partition is true
|
+| sink_columns | array | no |
| When this parameter is empty, all fields are sink
columns
|
+| is_enable_transaction | boolean | no | true
|
|
+| batch_size | int | no | 1000000
|
|
+| compress_codec | string | no | none
|
|
+| common-options | object | no | -
|
|
+| max_rows_in_memory | int | no | -
| Only used when file_format_type is excel.
|
+| sheet_name | string | no | Sheet${Random
number} | Only used when file_format_type is excel.
|
+| csv_string_quote_mode | enum | no | MINIMAL
| Only used when file_format is csv.
|
+| xml_root_tag | string | no | RECORDS
| Only used when file_format is xml.
|
+| xml_row_tag | string | no | RECORD
| Only used when file_format is xml.
|
+| xml_use_attr_format | boolean | no | -
| Only used when file_format is xml.
|
| single_file_mode | boolean | no | false
| Each parallelism will only output one file. When
this parameter is turned on, batch_size will not take effect. The output file
name does not have a file block suffix. |
-| create_empty_file_when_no_data | boolean | no | false
| When there is no data synchronization upstream,
the corresponding data files are still generated.
|
-| parquet_avro_write_timestamp_as_int96 | boolean | no | false
| Only used when file_format is parquet.
|
-| parquet_avro_write_fixed_as_int96 | array | no | -
| Only used when file_format is parquet.
|
-| encoding | string | no | "UTF-8"
| Only used when file_format_type is
json,text,csv,xml.
|
+| create_empty_file_when_no_data | boolean | no | false
| When there is no data synchronization upstream,
the corresponding data files are still generated.
|
+| parquet_avro_write_timestamp_as_int96 | boolean | no | false
| Only used when file_format is parquet.
|
+| parquet_avro_write_fixed_as_int96 | array | no | -
| Only used when file_format is parquet.
|
+| encoding | string | no | "UTF-8"
| Only used when file_format_type is
json,text,csv,xml.
|
### path [string]
@@ -138,7 +138,7 @@ Please note that, The final file name will end with the
file_format_type's suffi
### field_delimiter [string]
-The separator between columns in a row of data. Only needed by `text` file
format.
+The separator between columns in a row of data. Only needed by `text` and
`csv` file format.
### row_delimiter [string]
diff --git a/docs/en/connector-v2/sink/S3File.md
b/docs/en/connector-v2/sink/S3File.md
index 99e79f3941..f6dc178f84 100644
--- a/docs/en/connector-v2/sink/S3File.md
+++ b/docs/en/connector-v2/sink/S3File.md
@@ -114,8 +114,8 @@ If write to `csv`, `text` file type, All column will be
string.
| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
| file_format_type | string | no | "csv"
|
|
| filename_extension | string | no | -
| Override the default file name
extensions with custom file name extensions. E.g. `.xml`, `.json`, `dat`,
`.customtype` |
-| field_delimiter | string | no | '\001'
| Only used when file_format is text
|
-| row_delimiter | string | no | "\n"
| Only used when file_format is `text`,
`csv` and `json`
|
+| field_delimiter | string | no | '\001' for text
and ',' for csv | Only used when file_format is text and
csv
|
+| row_delimiter | string | no | "\n"
| Only used when file_format is `text`,
`csv` and `json`
|
| have_partition | boolean | no | false
| Whether you need processing partitions.
|
| partition_by | array | no | -
| Only used when have_partition is true
|
| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used when
have_partition is true
|
@@ -194,7 +194,7 @@ Please note that, The final file name will end with the
file_format_type's suffi
### field_delimiter [string]
-The separator between columns in a row of data. Only needed by `text` file
format.
+The separator between columns in a row of data. Only needed by `text` and
`csv` file format.
### row_delimiter [string]
diff --git a/docs/en/connector-v2/sink/SftpFile.md
b/docs/en/connector-v2/sink/SftpFile.md
index a0f6e79302..0ac163f144 100644
--- a/docs/en/connector-v2/sink/SftpFile.md
+++ b/docs/en/connector-v2/sink/SftpFile.md
@@ -51,8 +51,8 @@ If you use SeaTunnel Engine, It automatically integrated the
hadoop jar when you
| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
| file_format_type | string | no | "csv"
|
|
| filename_extension | string | no | -
| Override the default file name extensions with
custom file name extensions. E.g. `.xml`, `.json`, `dat`, `.customtype`
|
-| field_delimiter | string | no | '\001'
| Only used when file_format_type is text
|
-| row_delimiter | string | no | "\n"
| Only used when file_format_type is `text`, `csv`
and `json`
|
+| field_delimiter | string | no | '\001' for text
and ',' for csv | Only used when file_format_type is text and csv
|
+| row_delimiter | string | no | "\n"
| Only used when file_format_type is `text`, `csv`
and `json`
|
| have_partition | boolean | no | false
| Whether you need processing partitions.
|
| partition_by | array | no | -
| Only used then have_partition is true
|
| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true
|
@@ -135,7 +135,7 @@ Please note that, The final file name will end with the
file_format_type's suffi
### field_delimiter [string]
-The separator between columns in a row of data. Only needed by `text` file
format.
+The separator between columns in a row of data. Only needed by `text` and
`csv` file format.
### row_delimiter [string]
diff --git a/docs/en/connector-v2/source/CosFile.md
b/docs/en/connector-v2/source/CosFile.md
index 1e86f6c2e3..92fe080254 100644
--- a/docs/en/connector-v2/source/CosFile.md
+++ b/docs/en/connector-v2/source/CosFile.md
@@ -51,35 +51,35 @@ To use this connector you need put
hadoop-cos-{hadoop.version}-{version}.jar and
## Options
-| name | type | required | default value |
-|---------------------------|---------|----------|---------------------|
-| path | string | yes | - |
-| file_format_type | string | yes | - |
-| bucket | string | yes | - |
-| secret_id | string | yes | - |
-| secret_key | string | yes | - |
-| region | string | yes | - |
-| read_columns | list | yes | - |
-| delimiter/field_delimiter | string | no | \001 |
-| row_delimiter | string | no | \n |
-| parse_partition_from_path | boolean | no | true |
-| skip_header_row_number | long | no | 0 |
-| date_format | string | no | yyyy-MM-dd |
-| datetime_format | string | no | yyyy-MM-dd HH:mm:ss |
-| time_format | string | no | HH:mm:ss |
-| schema | config | no | - |
-| sheet_name | string | no | - |
-| xml_row_tag | string | no | - |
-| xml_use_attr_format | boolean | no | - |
-| csv_use_header_line | boolean | no | false |
-| file_filter_pattern | string | no | - |
-| filename_extension | string | no | - |
-| compress_codec | string | no | none |
-| archive_compress_codec | string | no | none |
-| encoding | string | no | UTF-8 |
-| binary_chunk_size | int | no | 1024 |
-| binary_complete_file_mode | boolean | no | false |
-| common-options | | no | - |
+| name | type | required | default value
|
+|---------------------------|---------|----------|-----------------------------|
+| path | string | yes | -
|
+| file_format_type | string | yes | -
|
+| bucket | string | yes | -
|
+| secret_id | string | yes | -
|
+| secret_key | string | yes | -
|
+| region | string | yes | -
|
+| read_columns | list | yes | -
|
+| delimiter/field_delimiter | string | no | \001 for text and , for csv
|
+| row_delimiter | string | no | \n
|
+| parse_partition_from_path | boolean | no | true
|
+| skip_header_row_number | long | no | 0
|
+| date_format | string | no | yyyy-MM-dd
|
+| datetime_format | string | no | yyyy-MM-dd HH:mm:ss
|
+| time_format | string | no | HH:mm:ss
|
+| schema | config | no | -
|
+| sheet_name | string | no | -
|
+| xml_row_tag | string | no | -
|
+| xml_use_attr_format | boolean | no | -
|
+| csv_use_header_line | boolean | no | false
|
+| file_filter_pattern | string | no | -
|
+| filename_extension | string | no | -
|
+| compress_codec | string | no | none
|
+| archive_compress_codec | string | no | none
|
+| encoding | string | no | UTF-8
|
+| binary_chunk_size | int | no | 1024
|
+| binary_complete_file_mode | boolean | no | false
|
+| common-options | | no | -
|
### path [string]
diff --git a/docs/en/connector-v2/source/FtpFile.md
b/docs/en/connector-v2/source/FtpFile.md
index e927ffe4ca..4a7dc78f44 100644
--- a/docs/en/connector-v2/source/FtpFile.md
+++ b/docs/en/connector-v2/source/FtpFile.md
@@ -44,38 +44,38 @@ If you use SeaTunnel Engine, It automatically integrated
the hadoop jar when you
## Options
-| name | type | required | default value |
-|-----------------------------|---------|----------|---------------------|
-| host | string | yes | - |
-| port | int | yes | - |
-| user | string | yes | - |
-| password | string | yes | - |
-| path | string | yes | - |
-| file_format_type | string | yes | - |
-| connection_mode | string | no | active_local |
-| remote_verification_enabled | boolean | no | true |
-| delimiter/field_delimiter | string | no | \001 |
-| row_delimiter | string | no | \n |
-| read_columns | list | no | - |
-| parse_partition_from_path | boolean | no | true |
-| date_format | string | no | yyyy-MM-dd |
-| datetime_format | string | no | yyyy-MM-dd HH:mm:ss |
-| time_format | string | no | HH:mm:ss |
-| skip_header_row_number | long | no | 0 |
-| schema | config | no | - |
-| sheet_name | string | no | - |
-| xml_row_tag | string | no | - |
-| xml_use_attr_format | boolean | no | - |
-| csv_use_header_line | boolean | no | - |
-| file_filter_pattern | string | no | - |
-| filename_extension | string | no | - |
-| compress_codec | string | no | none |
-| archive_compress_codec | string | no | none |
-| encoding | string | no | UTF-8 |
-| null_format | string | no | - |
-| binary_chunk_size | int | no | 1024 |
-| binary_complete_file_mode | boolean | no | false |
-| common-options | | no | - |
+| name | type | required | default value
|
+|-----------------------------|---------|----------|-----------------------------|
+| host | string | yes | -
|
+| port | int | yes | -
|
+| user | string | yes | -
|
+| password | string | yes | -
|
+| path | string | yes | -
|
+| file_format_type | string | yes | -
|
+| connection_mode | string | no | active_local
|
+| remote_verification_enabled | boolean | no | true
|
+| delimiter/field_delimiter | string | no | \001 for text and , for
csv |
+| row_delimiter | string | no | \n
|
+| read_columns | list | no | -
|
+| parse_partition_from_path | boolean | no | true
|
+| date_format | string | no | yyyy-MM-dd
|
+| datetime_format | string | no | yyyy-MM-dd HH:mm:ss
|
+| time_format | string | no | HH:mm:ss
|
+| skip_header_row_number | long | no | 0
|
+| schema | config | no | -
|
+| sheet_name | string | no | -
|
+| xml_row_tag | string | no | -
|
+| xml_use_attr_format | boolean | no | -
|
+| csv_use_header_line | boolean | no | -
|
+| file_filter_pattern | string | no | -
|
+| filename_extension | string | no | -
|
+| compress_codec | string | no | none
|
+| archive_compress_codec | string | no | none
|
+| encoding | string | no | UTF-8
|
+| null_format | string | no | -
|
+| binary_chunk_size | int | no | 1024
|
+| binary_complete_file_mode | boolean | no | false
|
+| common-options | | no | -
|
### host [string]
diff --git a/docs/en/connector-v2/source/HdfsFile.md
b/docs/en/connector-v2/source/HdfsFile.md
index ba91c6d8f2..3e8d0e7b2b 100644
--- a/docs/en/connector-v2/source/HdfsFile.md
+++ b/docs/en/connector-v2/source/HdfsFile.md
@@ -47,38 +47,38 @@ Read data from hdfs file system.
## Source Options
-| Name | Type | Required | Default |
Description
|
-|---------------------------|---------|----------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| path | string | yes | - | The
source file path.
|
-| file_format_type | string | yes | - | We
supported as the following file types:`text` `csv` `parquet` `orc` `json`
`excel` `xml` `binary`.Please note that, The final file name will end with the
file_format's suffix, the suffix of the text file is `txt`.
|
-| fs.defaultFS | string | yes | - | The
hadoop cluster address that start with `hdfs://`, for example:
`hdfs://hadoopcluster`
|
-| read_columns | list | no | - | The
read column list of the data source, user can use it to implement field
projection.The file type supported column projection as the following
shown:[text,json,csv,orc,parquet,excel,xml].Tips: If the user wants to use this
feature when reading `text` `json` `csv` files, the schema option must be
configured. |
-| hdfs_site_path | string | no | - | The
path of `hdfs-site.xml`, used to load ha configuration of namenodes
|
-| delimiter/field_delimiter | string | no | \001 | Field
delimiter, used to tell connector how to slice and dice fields when reading
text files. default `\001`, the same as hive's default delimiter
|
-| row_delimiter | string | no | \n | Row
delimiter, used to tell connector how to slice and dice rows when reading text
files. default `\n`
|
-| parse_partition_from_path | boolean | no | true |
Control whether parse the partition keys and values from file path. For example
if you read a file from path
`hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`. Every
record data from file will be added these two
fields:[name:tyrantlucifer,age:26].Tips:Do not define partition fields in
schema option. |
-| date_format | string | no | yyyy-MM-dd | Date
type format, used to tell connector how to convert string to date, supported as
the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd` default
`yyyy-MM-dd`.Date type format, used to tell connector how to convert string to
date, supported as the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`
default `yyyy-MM-dd` |
-| datetime_format | string | no | yyyy-MM-dd HH:mm:ss |
Datetime type format, used to tell connector how to convert string to datetime,
supported as the following formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss`
`yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` .default `yyyy-MM-dd HH:mm:ss`
|
-| time_format | string | no | HH:mm:ss | Time
type format, used to tell connector how to convert string to time, supported as
the following formats:`HH:mm:ss` `HH:mm:ss.SSS`.default `HH:mm:ss`
|
-| remote_user | string | no | - | The
login user used to connect to hadoop login name. It is intended to be used for
remote users in RPC, it won't have any credentials.
|
-| krb5_path | string | no | /etc/krb5.conf | The
krb5 path of kerberos
|
-| kerberos_principal | string | no | - | The
principal of kerberos
|
-| kerberos_keytab_path | string | no | - | The
keytab path of kerberos
|
-| skip_header_row_number | long | no | 0 | Skip
the first few lines, but only for the txt and csv.For example, set like
following:`skip_header_row_number = 2`.then Seatunnel will skip the first 2
lines from source files
|
-| schema | config | no | - | the
schema fields of upstream data
|
-| sheet_name | string | no | - |
Reader the sheet of the workbook,Only used when file_format is excel.
|
-| xml_row_tag | string | no | - |
Specifies the tag name of the data rows within the XML file, only used when
file_format is xml.
|
-| xml_use_attr_format | boolean | no | - |
Specifies whether to process data using the tag attribute format, only used
when file_format is xml.
|
-| csv_use_header_line | boolean | no | false |
Whether to use the header line to parse the file, only used when the
file_format is `csv` and the file contains the header line that match RFC 4180
|
-| file_filter_pattern | string | no | |
Filter pattern, which used for filtering files.
|
-| filename_extension | string | no | - |
Filter filename extension, which used for filtering files with specific
extension. Example: `csv` `.txt` `json` `.xml`.
|
-| compress_codec | string | no | none | The
compress codec of files
|
-| archive_compress_codec | string | no | none |
-| encoding | string | no | UTF-8 |
|
-| null_format | string | no | - | Only
used when file_format_type is text. null_format to define which strings can be
represented as null. e.g: `\N`
|
-| binary_chunk_size | int | no | 1024 | Only
used when file_format_type is binary. The chunk size (in bytes) for reading
binary files. Default is 1024 bytes. Larger values may improve performance for
large files but use more memory.
|
-| binary_complete_file_mode | boolean | no | false | Only
used when file_format_type is binary. Whether to read the complete file as a
single chunk instead of splitting into chunks. When enabled, the entire file
content will be read into memory at once. Default is false.
|
-| common-options | | no | - |
Source plugin common parameters, please refer to [Source Common
Options](../source-common-options.md) for details.
|
+| Name | Type | Required | Default
| Description
|
+|---------------------------|---------|----------|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| path | string | yes | -
| The source file path.
|
+| file_format_type | string | yes | -
| We supported as the following file types:`text` `csv` `parquet` `orc` `json`
`excel` `xml` `binary`.Please note that, The final file name will end with the
file_format's suffix, the suffix of the text file is `txt`.
|
+| fs.defaultFS | string | yes | -
| The hadoop cluster address that start with `hdfs://`, for example:
`hdfs://hadoopcluster`
|
+| read_columns | list | no | -
| The read column list of the data source, user can use it to implement field
projection.The file type supported column projection as the following
shown:[text,json,csv,orc,parquet,excel,xml].Tips: If the user wants to use this
feature when reading `text` `json` `csv` files, the schema option must be
configured. |
+| hdfs_site_path | string | no | -
| The path of `hdfs-site.xml`, used to load ha configuration of namenodes
|
+| delimiter/field_delimiter | string | no | \001 for text and , for csv
| Field delimiter, used to tell connector how to slice and dice fields when
reading text files. default `\001`, the same as hive's default delimiter
|
+| row_delimiter | string | no | \n
| Row delimiter, used to tell connector how to slice and dice rows when reading
text files. default `\n`
|
+| parse_partition_from_path | boolean | no | true
| Control whether parse the partition keys and values from file path. For
example if you read a file from path
`hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`. Every
record data from file will be added these two
fields:[name:tyrantlucifer,age:26].Tips:Do not define partition fields in
schema option. |
+| date_format | string | no | yyyy-MM-dd
| Date type format, used to tell connector how to convert string to date,
supported as the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`
default `yyyy-MM-dd`.Date type format, used to tell connector how to convert
string to date, supported as the following formats:`yyyy-MM-dd` `yyyy.MM.dd`
`yyyy/MM/dd` default `yyyy-MM-dd` |
+| datetime_format | string | no | yyyy-MM-dd HH:mm:ss
| Datetime type format, used to tell connector how to convert string to
datetime, supported as the following formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd
HH:mm:ss` `yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` .default `yyyy-MM-dd HH:mm:ss`
|
+| time_format | string | no | HH:mm:ss
| Time type format, used to tell connector how to convert string to time,
supported as the following formats:`HH:mm:ss` `HH:mm:ss.SSS`.default `HH:mm:ss`
|
+| remote_user | string | no | -
| The login user used to connect to hadoop login name. It is intended to be
used for remote users in RPC, it won't have any credentials.
|
+| krb5_path | string | no | /etc/krb5.conf
| The krb5 path of kerberos
|
+| kerberos_principal | string | no | -
| The principal of kerberos
|
+| kerberos_keytab_path | string | no | -
| The keytab path of kerberos
|
+| skip_header_row_number | long | no | 0
| Skip the first few lines, but only for the txt and csv.For example, set like
following:`skip_header_row_number = 2`.then Seatunnel will skip the first 2
lines from source files
|
+| schema | config | no | -
| the schema fields of upstream data
|
+| sheet_name | string | no | -
| Reader the sheet of the workbook,Only used when file_format is excel.
|
+| xml_row_tag | string | no | -
| Specifies the tag name of the data rows within the XML file, only used when
file_format is xml.
|
+| xml_use_attr_format | boolean | no | -
| Specifies whether to process data using the tag attribute format, only used
when file_format is xml.
|
+| csv_use_header_line | boolean | no | false
| Whether to use the header line to parse the file, only used when the
file_format is `csv` and the file contains the header line that match RFC 4180
|
+| file_filter_pattern | string | no |
| Filter pattern, which used for filtering files.
|
+| filename_extension | string | no | -
| Filter filename extension, which used for filtering files with specific
extension. Example: `csv` `.txt` `json` `.xml`.
|
+| compress_codec | string | no | none
| The compress codec of files
|
+| archive_compress_codec | string | no | none
|
+| encoding | string | no | UTF-8
|
|
+| null_format | string | no | -
| Only used when file_format_type is text. null_format to define which strings
can be represented as null. e.g: `\N`
|
+| binary_chunk_size | int | no | 1024
| Only used when file_format_type is binary. The chunk size (in bytes) for
reading binary files. Default is 1024 bytes. Larger values may improve
performance for large files but use more memory.
|
+| binary_complete_file_mode | boolean | no | false
| Only used when file_format_type is binary. Whether to read the complete file
as a single chunk instead of splitting into chunks. When enabled, the entire
file content will be read into memory at once. Default is false.
|
+| common-options | | no | -
| Source plugin common parameters, please refer to [Source Common
Options](../source-common-options.md) for details.
|
### delimiter/field_delimiter [string]
diff --git a/docs/en/connector-v2/source/LocalFile.md
b/docs/en/connector-v2/source/LocalFile.md
index 9cf1fc6ad6..a36e64b1c2 100644
--- a/docs/en/connector-v2/source/LocalFile.md
+++ b/docs/en/connector-v2/source/LocalFile.md
@@ -54,7 +54,7 @@ If you use SeaTunnel Engine, It automatically integrated the
hadoop jar when you
| path | string | yes | -
|
| file_format_type | string | yes | -
|
| read_columns | list | no | -
|
-| delimiter/field_delimiter | string | no | \001
|
+| delimiter/field_delimiter | string | no | \001 for text and , for csv
|
| parse_partition_from_path | boolean | no | true
|
| date_format | string | no | yyyy-MM-dd
|
| datetime_format | string | no | yyyy-MM-dd HH:mm:ss
|
diff --git a/docs/en/connector-v2/source/OssJindoFile.md
b/docs/en/connector-v2/source/OssJindoFile.md
index e62e0eb4f2..94675bbeb9 100644
--- a/docs/en/connector-v2/source/OssJindoFile.md
+++ b/docs/en/connector-v2/source/OssJindoFile.md
@@ -55,33 +55,33 @@ It only supports hadoop version **2.9.X+**.
## Options
-| name | type | required | default value |
-|---------------------------|---------|----------|---------------------|
-| path | string | yes | - |
-| file_format_type | string | yes | - |
-| bucket | string | yes | - |
-| access_key | string | yes | - |
-| access_secret | string | yes | - |
-| endpoint | string | yes | - |
-| read_columns | list | no | - |
-| delimiter/field_delimiter | string | no | \001 |
-| row_delimiter | string | no | \n |
-| parse_partition_from_path | boolean | no | true |
-| date_format | string | no | yyyy-MM-dd |
-| datetime_format | string | no | yyyy-MM-dd HH:mm:ss |
-| time_format | string | no | HH:mm:ss |
-| skip_header_row_number | long | no | 0 |
-| schema | config | no | - |
-| sheet_name | string | no | - |
-| xml_row_tag | string | no | - |
-| xml_use_attr_format | boolean | no | - |
-| csv_use_header_line | boolean | no | false |
-| file_filter_pattern | string | no | |
-| compress_codec | string | no | none |
-| archive_compress_codec | string | no | none |
-| encoding | string | no | UTF-8 |
-| null_format | string | no | - |
-| common-options | | no | - |
+| name | type | required | default value
|
+|---------------------------|---------|----------|-----------------------------|
+| path | string | yes | -
|
+| file_format_type | string | yes | -
|
+| bucket | string | yes | -
|
+| access_key | string | yes | -
|
+| access_secret | string | yes | -
|
+| endpoint | string | yes | -
|
+| read_columns | list | no | -
|
+| delimiter/field_delimiter | string | no | \001 for text and , for csv
|
+| row_delimiter | string | no | \n
|
+| parse_partition_from_path | boolean | no | true
|
+| date_format | string | no | yyyy-MM-dd
|
+| datetime_format | string | no | yyyy-MM-dd HH:mm:ss
|
+| time_format | string | no | HH:mm:ss
|
+| skip_header_row_number | long | no | 0
|
+| schema | config | no | -
|
+| sheet_name | string | no | -
|
+| xml_row_tag | string | no | -
|
+| xml_use_attr_format | boolean | no | -
|
+| csv_use_header_line | boolean | no | false
|
+| file_filter_pattern | string | no |
|
+| compress_codec | string | no | none
|
+| archive_compress_codec | string | no | none
|
+| encoding | string | no | UTF-8
|
+| null_format | string | no | -
|
+| common-options | | no | -
|
### path [string]
diff --git a/docs/en/connector-v2/source/S3File.md
b/docs/en/connector-v2/source/S3File.md
index 9bd808ce6c..fa9831f0b6 100644
--- a/docs/en/connector-v2/source/S3File.md
+++ b/docs/en/connector-v2/source/S3File.md
@@ -199,7 +199,7 @@ If you assign file type to `parquet` `orc`, schema option
not required, connecto
| access_key | string | no | -
| Only used when
`fs.s3a.aws.credentials.provider =
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider `
[...]
| access_secret | string | no | -
| Only used when
`fs.s3a.aws.credentials.provider =
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider `
[...]
| hadoop_s3_properties | map | no | -
| If you need to add other option, you could
add it here and refer to this
[link](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html)
[...]
-| delimiter/field_delimiter | string | no | \001
| Field delimiter, used to tell connector how
to slice and dice fields when reading text files. Default `\001`, the same as
hive's default delimiter.
[...]
+| delimiter/field_delimiter | string | no | \001 for text and ,
for csv | Field delimiter, used to tell connector how
to slice and dice fields when reading text files. Default `\001`, the same as
hive's default delimiter.
[...]
| row_delimiter | string | no | \n
| Row delimiter, used to tell connector how to
slice and dice rows when reading text files. Default `\n`.
[...]
| parse_partition_from_path | boolean | no | true
| Control whether parse the partition keys and
values from file path. For example if you read a file from path
`s3n://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`. Every
record data from file will be added these two fields: name="tyrantlucifer",
age=16
[...]
| date_format | string | no | yyyy-MM-dd
| Date type format, used to tell connector how
to convert string to date, supported as the following formats:`yyyy-MM-dd`
`yyyy.MM.dd` `yyyy/MM/dd`. default `yyyy-MM-dd`
[...]
diff --git a/docs/en/connector-v2/source/SftpFile.md
b/docs/en/connector-v2/source/SftpFile.md
index 74edfec5f1..7d0820b59d 100644
--- a/docs/en/connector-v2/source/SftpFile.md
+++ b/docs/en/connector-v2/source/SftpFile.md
@@ -77,36 +77,36 @@ The File does not have a specific type list, and we can
indicate which SeaTunnel
## Source Options
-| Name | Type | Required | default value |
Description
|
-|---------------------------|---------|----------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| host | String | Yes | - | The
target sftp host is required
|
-| port | Int | Yes | - | The
target sftp port is required
|
-| user | String | Yes | - | The
target sftp username is required
|
-| password | String | Yes | - | The
target sftp password is required
|
-| path | String | Yes | - | The
source file path.
|
-| file_format_type | String | Yes | - |
Please check #file_format_type below
|
-| file_filter_pattern | String | No | - |
Filter pattern, which used for filtering files.
|
-| filename_extension | string | no | - |
Filter filename extension, which used for filtering files with specific
extension. Example: `csv` `.txt` `json` `.xml`.
|
-| delimiter/field_delimiter | String | No | \001 |
**delimiter** parameter will deprecate after version 2.3.5, please use
**field_delimiter** instead. <br/> Field delimiter, used to tell connector how
to slice and dice fields when reading text files. <br/> Default `\001`, the
same as hive's default delimiter
|
-| row_delimiter | string | no | \n | Row
delimiter, used to tell connector how to slice and dice rows when reading text
files. <br/> Default `\n`
|
-| parse_partition_from_path | Boolean | No | true |
Control whether parse the partition keys and values from file path <br/> For
example if you read a file from path
`oss://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26` <br/>
Every record data from file will be added these two fields: <br/> name
age <br/> tyrantlucifer 26 <br/> Tips: **Do not define partition fields
in schema option** |
-| date_format | String | No | yyyy-MM-dd | Date
type format, used to tell connector how to convert string to date, supported as
the following formats: <br/> `yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd` <br/>
default `yyyy-MM-dd`
|
-| datetime_format | String | No | yyyy-MM-dd HH:mm:ss |
Datetime type format, used to tell connector how to convert string to datetime,
supported as the following formats: <br/> `yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd
HH:mm:ss` `yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` <br/> default `yyyy-MM-dd
HH:mm:ss`
|
-| time_format | String | No | HH:mm:ss | Time
type format, used to tell connector how to convert string to time, supported as
the following formats: <br/> `HH:mm:ss` `HH:mm:ss.SSS` <br/> default `HH:mm:ss`
|
-| skip_header_row_number | Long | No | 0 | Skip
the first few lines, but only for the txt and csv. <br/> For example, set like
following: <br/> `skip_header_row_number = 2` <br/> then SeaTunnel will skip
the first 2 lines from source files
|
-| read_columns | list | no | - | The
read column list of the data source, user can use it to implement field
projection.
|
-| sheet_name | String | No | - |
Reader the sheet of the workbook,Only used when file_format is excel.
|
-| xml_row_tag | string | no | - |
Specifies the tag name of the data rows within the XML file, only used when
file_format is xml.
|
-| xml_use_attr_format | boolean | no | - |
Specifies whether to process data using the tag attribute format, only used
when file_format is xml.
|
-| csv_use_header_line | boolean | no | false |
Whether to use the header line to parse the file, only used when the
file_format is `csv` and the file contains the header line that match RFC 4180
|
-| schema | Config | No | - |
Please check #schema below
|
-| compress_codec | String | No | None | The
compress codec of files and the details that supported as the following shown:
<br/> - txt: `lzo` `None` <br/> - json: `lzo` `None` <br/> - csv: `lzo` `None`
<br/> - orc: `lzo` `snappy` `lz4` `zlib` `None` <br/> - parquet: `lzo` `snappy`
`lz4` `gzip` `brotli` `zstd` `None` <br/> Tips: excel type does Not support any
compression format |
-| archive_compress_codec | string | no | none |
-| encoding | string | no | UTF-8 |
-| null_format | string | no | - | Only
used when file_format_type is text. null_format to define which strings can be
represented as null. e.g: `\N`
|
-| binary_chunk_size | int | no | 1024 | Only
used when file_format_type is binary. The chunk size (in bytes) for reading
binary files. Default is 1024 bytes. Larger values may improve performance for
large files but use more memory.
|
-| binary_complete_file_mode | boolean | no | false | Only
used when file_format_type is binary. Whether to read the complete file as a
single chunk instead of splitting into chunks. When enabled, the entire file
content will be read into memory at once. Default is false.
|
-| common-options | | No | - |
Source plugin common parameters, please refer to [Source Common
Options](../source-common-options.md) for details.
|
+| Name | Type | Required | default value
| Description
|
+|---------------------------|---------|----------|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| host | String | Yes | -
| The target sftp host is required
|
+| port | Int | Yes | -
| The target sftp port is required
|
+| user | String | Yes | -
| The target sftp username is required
|
+| password | String | Yes | -
| The target sftp password is required
|
+| path | String | Yes | -
| The source file path.
|
+| file_format_type | String | Yes | -
| Please check #file_format_type below
|
+| file_filter_pattern | String | No | -
| Filter pattern, which used for filtering files.
|
+| filename_extension | string | no | -
| Filter filename extension, which used for filtering files with specific
extension. Example: `csv` `.txt` `json` `.xml`.
|
+| delimiter/field_delimiter | String | No | \001 for text and ',' for
csv | **delimiter** parameter will deprecate after version 2.3.5, please use
**field_delimiter** instead. <br/> Field delimiter, used to tell connector how
to slice and dice fields when reading text files. <br/> Default `\001`, the
same as hive's default delimiter
|
+| row_delimiter | string | no | \n
| Row delimiter, used to tell connector how to slice and dice rows when
reading text files. <br/> Default `\n`
|
+| parse_partition_from_path | Boolean | No | true
| Control whether parse the partition keys and values from file path <br/>
For example if you read a file from path
`oss://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26` <br/>
Every record data from file will be added these two fields: <br/> name
age <br/> tyrantlucifer 26 <br/> Tips: **Do not define partition fields
in schema option** |
+| date_format | String | No | yyyy-MM-dd
| Date type format, used to tell connector how to convert string to date,
supported as the following formats: <br/> `yyyy-MM-dd` `yyyy.MM.dd`
`yyyy/MM/dd` <br/> default `yyyy-MM-dd`
|
+| datetime_format | String | No | yyyy-MM-dd HH:mm:ss
| Datetime type format, used to tell connector how to convert string to
datetime, supported as the following formats: <br/> `yyyy-MM-dd HH:mm:ss`
`yyyy.MM.dd HH:mm:ss` `yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` <br/> default
`yyyy-MM-dd HH:mm:ss`
|
+| time_format | String | No | HH:mm:ss
| Time type format, used to tell connector how to convert string to time,
supported as the following formats: <br/> `HH:mm:ss` `HH:mm:ss.SSS` <br/>
default `HH:mm:ss`
|
+| skip_header_row_number | Long | No | 0
| Skip the first few lines, but only for the txt and csv. <br/> For example,
set like following: <br/> `skip_header_row_number = 2` <br/> then SeaTunnel
will skip the first 2 lines from source files
|
+| read_columns | list | no | -
| The read column list of the data source, user can use it to implement
field projection.
|
+| sheet_name | String | No | -
| Reader the sheet of the workbook,Only used when file_format is excel.
|
+| xml_row_tag | string | no | -
| Specifies the tag name of the data rows within the XML file, only used
when file_format is xml.
|
+| xml_use_attr_format | boolean | no | -
| Specifies whether to process data using the tag attribute format, only
used when file_format is xml.
|
+| csv_use_header_line | boolean | no | false
| Whether to use the header line to parse the file, only used when the
file_format is `csv` and the file contains the header line that match RFC 4180
|
+| schema | Config | No | -
| Please check #schema below
|
+| compress_codec | String | No | None
| The compress codec of files and the details that supported as the
following shown: <br/> - txt: `lzo` `None` <br/> - json: `lzo` `None` <br/> -
csv: `lzo` `None` <br/> - orc: `lzo` `snappy` `lz4` `zlib` `None` <br/> -
parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `None` <br/> Tips: excel
type does Not support any compression format |
+| archive_compress_codec | string | no | none
|
+| encoding | string | no | UTF-8
|
+| null_format | string | no | -
| Only used when file_format_type is text. null_format to define which
strings can be represented as null. e.g: `\N`
|
+| binary_chunk_size | int | no | 1024
| Only used when file_format_type is binary. The chunk size (in bytes) for
reading binary files. Default is 1024 bytes. Larger values may improve
performance for large files but use more memory.
|
+| binary_complete_file_mode | boolean | no | false
| Only used when file_format_type is binary. Whether to read the complete
file as a single chunk instead of splitting into chunks. When enabled, the
entire file content will be read into memory at once. Default is false.
|
+| common-options | | No | -
| Source plugin common parameters, please refer to [Source Common
Options](../source-common-options.md) for details.
|
### file_filter_pattern [string]
diff --git
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseFileSinkConfig.java
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseFileSinkConfig.java
index 64bc0538c4..12c3ee01c9 100644
---
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseFileSinkConfig.java
+++
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseFileSinkConfig.java
@@ -38,7 +38,7 @@ import static
org.apache.seatunnel.shade.com.google.common.base.Preconditions.ch
public class BaseFileSinkConfig implements DelimiterConfig, Serializable {
private static final long serialVersionUID = 1L;
protected CompressFormat compressFormat =
FileBaseSinkOptions.COMPRESS_CODEC.defaultValue();
- protected String fieldDelimiter =
FileBaseSinkOptions.FIELD_DELIMITER.defaultValue();
+ protected String fieldDelimiter;
protected String rowDelimiter =
FileBaseSinkOptions.ROW_DELIMITER.defaultValue();
protected int batchSize = FileBaseSinkOptions.BATCH_SIZE.defaultValue();
protected String path;
@@ -61,11 +61,6 @@ public class BaseFileSinkConfig implements DelimiterConfig,
Serializable {
if (config.hasPath(FileBaseSinkOptions.BATCH_SIZE.key())) {
this.batchSize =
config.getInt(FileBaseSinkOptions.BATCH_SIZE.key());
}
- if (config.hasPath(FileBaseSinkOptions.FIELD_DELIMITER.key())
- && StringUtils.isNotEmpty(
-
config.getString(FileBaseSinkOptions.FIELD_DELIMITER.key()))) {
- this.fieldDelimiter =
config.getString(FileBaseSinkOptions.FIELD_DELIMITER.key());
- }
if (config.hasPath(FileBaseSinkOptions.ROW_DELIMITER.key())) {
this.rowDelimiter =
config.getString(FileBaseSinkOptions.ROW_DELIMITER.key());
@@ -109,6 +104,18 @@ public class BaseFileSinkConfig implements
DelimiterConfig, Serializable {
this.fileFormat =
FileBaseSinkOptions.FILE_FORMAT_TYPE.defaultValue();
}
+ if (config.hasPath(FileBaseSinkOptions.FIELD_DELIMITER.key())
+ && StringUtils.isNotEmpty(
+
config.getString(FileBaseSinkOptions.FIELD_DELIMITER.key()))) {
+ this.fieldDelimiter =
config.getString(FileBaseSinkOptions.FIELD_DELIMITER.key());
+ } else {
+ if (FileFormat.CSV.equals(this.fileFormat)) {
+ this.fieldDelimiter = ",";
+ } else {
+ this.fieldDelimiter =
FileBaseSinkOptions.FIELD_DELIMITER.defaultValue();
+ }
+ }
+
if (config.hasPath(FileBaseSinkOptions.FILENAME_EXTENSION.key())
&& !StringUtils.isBlank(
config.getString(FileBaseSinkOptions.FILENAME_EXTENSION.key()))) {
diff --git
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java
index d0f067a379..4a947ea9e2 100644
---
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java
+++
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java
@@ -47,7 +47,6 @@ public enum FileFormat implements Serializable {
CSV("csv") {
@Override
public WriteStrategy getWriteStrategy(FileSinkConfig fileSinkConfig) {
- fileSinkConfig.setFieldDelimiter(",");
return new CsvWriteStrategy(fileSinkConfig);
}
diff --git
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/sink/writer/CsvWriteStrategy.java
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/sink/writer/CsvWriteStrategy.java
index 8e9cc3170b..9c0dabf708 100644
---
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/sink/writer/CsvWriteStrategy.java
+++
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/sink/writer/CsvWriteStrategy.java
@@ -57,6 +57,8 @@ public class CsvWriteStrategy extends
AbstractWriteStrategy<FSDataOutputStream>
private final CsvStringQuoteMode csvStringQuoteMode;
private SerializationSchema serializationSchema;
+ private final String fieldDelimiter;
+
public CsvWriteStrategy(FileSinkConfig fileSinkConfig) {
super(fileSinkConfig);
this.csvStringQuoteMode = fileSinkConfig.getCsvStringQuoteMode();
@@ -69,6 +71,7 @@ public class CsvWriteStrategy extends
AbstractWriteStrategy<FSDataOutputStream>
this.fileFormat = fileSinkConfig.getFileFormat();
this.enableHeaderWriter = fileSinkConfig.getEnableHeaderWriter();
this.charset =
EncodingUtils.tryParseCharset(fileSinkConfig.getEncoding());
+ this.fieldDelimiter = fileSinkConfig.getFieldDelimiter();
}
@Override
@@ -79,7 +82,7 @@ public class CsvWriteStrategy extends
AbstractWriteStrategy<FSDataOutputStream>
.seaTunnelRowType(
buildSchemaWithRowType(
catalogTable.getSeaTunnelRowType(),
sinkColumnsIndexInRow))
- .delimiter(",")
+ .delimiter(fieldDelimiter)
.dateFormatter(dateFormat)
.dateTimeFormatter(dateTimeFormat)
.timeFormatter(timeFormat)
diff --git
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java
index dc204ae62e..3df935e9b2 100644
---
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java
+++
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java
@@ -38,6 +38,7 @@ import
org.apache.seatunnel.format.csv.processor.CsvLineProcessor;
import org.apache.seatunnel.format.csv.processor.DefaultCsvLineProcessor;
import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVFormat.Builder;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
@@ -102,8 +103,9 @@ public class CsvReadStrategy extends AbstractReadStrategy {
actualInputStream = inputStream;
break;
}
-
- CSVFormat csvFormat = CSVFormat.DEFAULT;
+ Builder builder =
+
CSVFormat.EXCEL.builder().setIgnoreEmptyLines(true).setDelimiter(getDelimiter());
+ CSVFormat csvFormat = builder.build();
if (firstLineAsHeader) {
csvFormat = csvFormat.withFirstRecordAsHeader();
}
@@ -200,7 +202,7 @@ public class CsvReadStrategy extends AbstractReadStrategy {
ReadonlyConfig readonlyConfig =
ReadonlyConfig.fromConfig(pluginConfig);
CsvDeserializationSchema.Builder builder =
CsvDeserializationSchema.builder()
- .delimiter(",")
+ .delimiter(getDelimiter())
.csvLineProcessor(processor)
.nullFormat(
readonlyConfig
@@ -215,6 +217,11 @@ public class CsvReadStrategy extends AbstractReadStrategy {
return getActualSeaTunnelRowTypeInfo();
}
+ private String getDelimiter() {
+ ReadonlyConfig readonlyConfig =
ReadonlyConfig.fromConfig(pluginConfig);
+ return
readonlyConfig.getOptional(FileBaseSourceOptions.FIELD_DELIMITER).orElse(",");
+ }
+
@Override
public void setCatalogTable(CatalogTable catalogTable) {
SeaTunnelRowType rowType = catalogTable.getSeaTunnelRowType();
@@ -229,7 +236,7 @@ public class CsvReadStrategy extends AbstractReadStrategy {
initFormatter();
CsvDeserializationSchema.Builder builder =
CsvDeserializationSchema.builder()
- .delimiter(",")
+ .delimiter(getDelimiter())
.csvLineProcessor(processor)
.nullFormat(
readonlyConfig
diff --git
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategyTest.java
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategyTest.java
index c59de8717f..0aa0189481 100644
---
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategyTest.java
+++
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategyTest.java
@@ -71,6 +71,37 @@ public class CsvReadStrategyTest {
Assertions.assertEquals(100,
testCollector.getRows().get(1).getField(2));
}
+ @Test
+ public void testReadComplexCsv() throws Exception {
+ URL resource = CsvReadStrategyTest.class.getResource("/test-csv.csv");
+ String path = Paths.get(resource.toURI()).toString();
+ CsvReadStrategy csvReadStrategy = new CsvReadStrategy();
+ LocalConf localConf = new LocalConf(FS_DEFAULT_NAME_DEFAULT);
+ csvReadStrategy.init(localConf);
+ csvReadStrategy.getFileNamesByPath(path);
+ System.setProperty("field_delimiter", ";");
+ csvReadStrategy.setPluginConfig(ConfigFactory.systemProperties());
+ csvReadStrategy.setCatalogTable(
+ CatalogTableUtil.getCatalogTable(
+ "test",
+ new SeaTunnelRowType(
+ new String[] {"id", "name", "age"},
+ new SeaTunnelDataType[] {
+ BasicType.INT_TYPE, BasicType.STRING_TYPE,
BasicType.INT_TYPE
+ })));
+ TestCollector testCollector = new TestCollector();
+ csvReadStrategy.read(path, "", testCollector);
+
+ Assertions.assertEquals(2, testCollector.getRows().size());
+ Assertions.assertEquals(1, testCollector.getRows().get(0).getField(0));
+ Assertions.assertEquals(
+ "b" + System.lineSeparator() + "a",
testCollector.getRows().get(0).getField(1));
+ Assertions.assertEquals(10,
testCollector.getRows().get(0).getField(2));
+ Assertions.assertEquals(2, testCollector.getRows().get(1).getField(0));
+ Assertions.assertEquals("b",
testCollector.getRows().get(1).getField(1));
+ Assertions.assertEquals(100,
testCollector.getRows().get(1).getField(2));
+ }
+
public static class TestCollector implements Collector<SeaTunnelRow> {
private final List<SeaTunnelRow> rows = new ArrayList<>();
diff --git
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/writer/CsvWriteStrategyTest.java
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/writer/CsvWriteStrategyTest.java
index d41a150907..7202da3e01 100644
---
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/writer/CsvWriteStrategyTest.java
+++
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/writer/CsvWriteStrategyTest.java
@@ -109,4 +109,67 @@ public class CsvWriteStrategyTest {
Assertions.assertEquals(1, readRows.size());
readStrategy.close();
}
+
+ @DisabledOnOs(OS.WINDOWS)
+ @Test
+ public void testCsv2() throws Exception {
+ Map<String, Object> writeConfig = new HashMap<>();
+ writeConfig.put("tmp_path", TMP_PATH);
+ writeConfig.put("path", "file:///tmp/seatunnel/csv/int96");
+ writeConfig.put("file_format_type", FileFormat.CSV.name());
+ writeConfig.put("field_delimiter", ",");
+
+ SeaTunnelRowType writeRowType =
+ new SeaTunnelRowType(
+ new String[] {"id", "name", "age"},
+ new SeaTunnelDataType[] {
+ BasicType.INT_TYPE, BasicType.STRING_TYPE,
BasicType.INT_TYPE
+ });
+ FileSinkConfig writeSinkConfig =
+ new FileSinkConfig(ConfigFactory.parseMap(writeConfig),
writeRowType);
+ CsvWriteStrategy writeStrategy = new CsvWriteStrategy(writeSinkConfig);
+ ParquetReadStrategyTest.LocalConf hadoopConf =
+ new ParquetReadStrategyTest.LocalConf(FS_DEFAULT_NAME_DEFAULT);
+ writeStrategy.setCatalogTable(
+ CatalogTableUtil.getCatalogTable("test", null, null, "test",
writeRowType));
+ writeStrategy.init(hadoopConf, "test1", "test1", 0);
+ writeStrategy.beginTransaction(1L);
+ writeStrategy.write(new SeaTunnelRow(new Object[] {1, "a", 20}));
+ writeStrategy.finishAndCloseFile();
+ writeStrategy.close();
+
+ CsvReadStrategy readStrategy = new CsvReadStrategy();
+ readStrategy.init(hadoopConf);
+ List<String> readFiles = readStrategy.getFileNamesByPath(TMP_PATH);
+ readStrategy.setPluginConfig(ConfigFactory.empty());
+ readStrategy.setCatalogTable(
+ CatalogTableUtil.getCatalogTable(
+ "test",
+ new SeaTunnelRowType(
+ new String[] {"id", "name", "age"},
+ new SeaTunnelDataType[] {
+ BasicType.INT_TYPE, BasicType.STRING_TYPE,
BasicType.INT_TYPE
+ })));
+ Assertions.assertEquals(1, readFiles.size());
+ String readFilePath = readFiles.get(0);
+ List<SeaTunnelRow> readRows = new ArrayList<>();
+ Collector<SeaTunnelRow> readCollector =
+ new Collector<SeaTunnelRow>() {
+ @Override
+ public void collect(SeaTunnelRow record) {
+ Assertions.assertEquals(1, record.getField(0));
+ Assertions.assertEquals("a", record.getField(1));
+ Assertions.assertEquals(20, record.getField(2));
+ readRows.add(record);
+ }
+
+ @Override
+ public Object getCheckpointLock() {
+ return null;
+ }
+ };
+ readStrategy.read(readFilePath, "test", readCollector);
+ Assertions.assertEquals(1, readRows.size());
+ readStrategy.close();
+ }
}
diff --git
a/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/resources/test-csv.csv
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/resources/test-csv.csv
new file mode 100644
index 0000000000..d72bc175a4
--- /dev/null
+++
b/seatunnel-connectors-v2/connector-file/connector-file-base/src/test/resources/test-csv.csv
@@ -0,0 +1,3 @@
+1;"b
+a";"10"
+2;b;100
\ No newline at end of file
diff --git
a/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-local-e2e/src/test/resources/csv/local_csv_to_assert.conf
b/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-local-e2e/src/test/resources/csv/local_csv_to_assert.conf
index 971b8366a5..be6ef0e64c 100644
---
a/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-local-e2e/src/test/resources/csv/local_csv_to_assert.conf
+++
b/seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-local-e2e/src/test/resources/csv/local_csv_to_assert.conf
@@ -32,7 +32,7 @@ source {
path = "/tmp/csv/seatunnel"
plugin_output = "fake"
file_format_type = csv
- field_delimiter = "\t"
+ field_delimiter = ","
row_delimiter = "\n"
skip_header_row_number = 1
schema = {
