[seatunnel] branch dev updated: [Doc] Improve S3File Source & S3File Sink document (#5101)

ic4y Thu, 10 Aug 2023 23:23:20 -0700

This is an automated email from the ASF dual-hosted git repository.

ic4y pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git



The following commit(s) were added to refs/heads/dev by this push:
     new 5b3cbfbb7b [Doc] Improve S3File Source & S3File Sink document (#5101)
5b3cbfbb7b is described below

commit 5b3cbfbb7b9dc0a575adbe7aa4ae390a798195fa
Author: Eric <[email protected]>
AuthorDate: Fri Aug 11 14:23:09 2023 +0800

    [Doc] Improve S3File Source & S3File Sink document (#5101)
    
    * Improve S3File Source & S3File Sink document
---
 docs/en/connector-v2/sink/S3File.md   | 241 +++++++++++++++++++-------
 docs/en/connector-v2/source/S3File.md | 310 ++++++++++++++++++----------------
 2 files changed, 336 insertions(+), 215 deletions(-)

diff --git a/docs/en/connector-v2/sink/S3File.md 
b/docs/en/connector-v2/sink/S3File.md
index 7841afdf04..4bb670ae38 100644
--- a/docs/en/connector-v2/sink/S3File.md
+++ b/docs/en/connector-v2/sink/S3File.md
@@ -1,24 +1,17 @@
 # S3File
 
-> S3 file sink connector
+> S3 File Sink Connector
 
-## Description
-
-Output data to aws s3 file system.
-
-:::tip
+## Support Those Engines
 
-If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
 
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Engine. You can check the jar package under 
${SEATUNNEL_HOME}/lib to confirm this.
-
-To use this connector you need put hadoop-aws-3.1.4.jar and 
aws-java-sdk-bundle-1.11.271.jar in ${SEATUNNEL_HOME}/lib dir.
-
-:::
-
-## Key features
+## Key Features
 
 - [x] [exactly-once](../../concept/connector-v2-features.md)
+- [ ] [cdc](../../concept/connector-v2-features.md)
 
 By default, we use 2PC commit to ensure `exactly-once`
 
@@ -30,59 +23,100 @@ By default, we use 2PC commit to ensure `exactly-once`
   - [x] json
   - [x] excel
 
-## Options
-
-|               name               |  type   | required |                     
default value                     |                                             
   remarks                                                 |
-|----------------------------------|---------|----------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
-| path                             | string  | yes      | -                    
                                 |                                              
                                                          |
-| bucket                           | string  | yes      | -                    
                                 |                                              
                                                          |
-| fs.s3a.endpoint                  | string  | yes      | -                    
                                 |                                              
                                                          |
-| fs.s3a.aws.credentials.provider  | string  | yes      | 
com.amazonaws.auth.InstanceProfileCredentialsProvider |                         
                                                                               |
-| access_key                       | string  | no       | -                    
                                 | Only used when 
fs.s3a.aws.credentials.provider = 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider |
-| access_secret                    | string  | no       | -                    
                                 | Only used when 
fs.s3a.aws.credentials.provider = 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider |
-| custom_filename                  | boolean | no       | false                
                                 | Whether you need custom the filename         
                                                          |
-| file_name_expression             | string  | no       | "${transactionId}"   
                                 | Only used when custom_filename is true       
                                                          |
-| filename_time_format             | string  | no       | "yyyy.MM.dd"         
                                 | Only used when custom_filename is true       
                                                          |
-| file_format_type                 | string  | no       | "csv"                
                                 |                                              
                                                          |
-| field_delimiter                  | string  | no       | '\001'               
                                 | Only used when file_format_type is text      
                                                          |
-| row_delimiter                    | string  | no       | "\n"                 
                                 | Only used when file_format_type is text      
                                                          |
-| have_partition                   | boolean | no       | false                
                                 | Whether you need processing partitions.      
                                                          |
-| partition_by                     | array   | no       | -                    
                                 | Only used then have_partition is true        
                                                          |
-| partition_dir_expression         | string  | no       | 
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/"            | Only used then 
have_partition is true                                                          
        |
-| is_partition_field_write_in_file | boolean | no       | false                
                                 | Only used then have_partition is true        
                                                          |
-| sink_columns                     | array   | no       |                      
                                 | When this parameter is empty, all fields are 
sink columns                                              |
-| is_enable_transaction            | boolean | no       | true                 
                                 |                                              
                                                          |
-| batch_size                       | int     | no       | 1000000              
                                 |                                              
                                                          |
-| compress_codec                   | string  | no       | none                 
                                 |                                              
                                                          |
-| common-options                   | object  | no       | -                    
                                 |                                              
                                                          |
-| max_rows_in_memory               | int     | no       | -                    
                                 | Only used when file_format_type is excel.    
                                                          |
-| sheet_name                       | string  | no       | Sheet${Random 
number}                                 | Only used when file_format_type is 
excel.                                                              |
-
-### path [string]
-
-The target dir path is required.
-
-### bucket [string]
-
-The bucket address of s3 file system, for example: `s3n://seatunnel-test`, if 
you use `s3a` protocol, this parameter should be `s3a://seatunnel-test`.
-
-### fs.s3a.endpoint [string]
-
-fs s3a endpoint
-
-### fs.s3a.aws.credentials.provider [string]
-
-The way to authenticate s3a. We only support 
`org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` and 
`com.amazonaws.auth.InstanceProfileCredentialsProvider` now.
-
-More information about the credential provider you can see [Hadoop AWS 
Document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Simple_name.2Fsecret_credentials_with_SimpleAWSCredentialsProvider.2A)
-
-### access_key [string]
-
-The access key of s3 file system. If this parameter is not set, please confirm 
that the credential provider chain can be authenticated correctly, you could 
check this 
[hadoop-aws](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html)
+## Description
 
-### access_secret [string]
+Output data to aws s3 file system.
 
-The access secret of s3 file system. If this parameter is not set, please 
confirm that the credential provider chain can be authenticated correctly, you 
could check this 
[hadoop-aws](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html)
+## Supported DataSource Info
+
+| Datasource | Supported Versions |
+|------------|--------------------|
+| S3         | current            |
+
+## Database Dependency
+
+> If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
+>
+> If you use SeaTunnel Engine, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Engine. You can check the jar package under 
`${SEATUNNEL_HOME}/lib` to confirm this.
+> To use this connector you need put `hadoop-aws-3.1.4.jar` and 
`aws-java-sdk-bundle-1.11.271.jar` in `${SEATUNNEL_HOME}/lib` dir.
+
+## Data Type Mapping
+
+If write to `csv`, `text` file type, All column will be string.
+
+### Orc File Type
+
+| SeaTunnel Data type  |     Orc Data type     |
+|----------------------|-----------------------|
+| STRING               | STRING                |
+| BOOLEAN              | BOOLEAN               |
+| TINYINT              | BYTE                  |
+| SMALLINT             | SHORT                 |
+| INT                  | INT                   |
+| BIGINT               | LONG                  |
+| FLOAT                | FLOAT                 |
+| FLOAT                | FLOAT                 |
+| DOUBLE               | DOUBLE                |
+| DECIMAL              | DECIMAL               |
+| BYTES                | BINARY                |
+| DATE                 | DATE                  |
+| TIME <br/> TIMESTAMP | TIMESTAMP             |
+| ROW                  | STRUCT                |
+| NULL                 | UNSUPPORTED DATA TYPE |
+| ARRAY                | LIST                  |
+| Map                  | Map                   |
+
+### Parquet File Type
+
+| SeaTunnel Data type  |   Parquet Data type   |
+|----------------------|-----------------------|
+| STRING               | STRING                |
+| BOOLEAN              | BOOLEAN               |
+| TINYINT              | INT_8                 |
+| SMALLINT             | INT_16                |
+| INT                  | INT32                 |
+| BIGINT               | INT64                 |
+| FLOAT                | FLOAT                 |
+| FLOAT                | FLOAT                 |
+| DOUBLE               | DOUBLE                |
+| DECIMAL              | DECIMAL               |
+| BYTES                | BINARY                |
+| DATE                 | DATE                  |
+| TIME <br/> TIMESTAMP | TIMESTAMP_MILLIS      |
+| ROW                  | GroupType             |
+| NULL                 | UNSUPPORTED DATA TYPE |
+| ARRAY                | LIST                  |
+| Map                  | Map                   |
+
+## Sink Options
+
+|               name               |  type   | required |                     
default value                     |                                             
                                 Description                                    
                                          |
+|----------------------------------|---------|----------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| path                             | string  | yes      | -                    
                                 |                                              
                                                                                
                                         |
+| bucket                           | string  | yes      | -                    
                                 |                                              
                                                                                
                                         |
+| fs.s3a.endpoint                  | string  | yes      | -                    
                                 |                                              
                                                                                
                                         |
+| fs.s3a.aws.credentials.provider  | string  | yes      | 
com.amazonaws.auth.InstanceProfileCredentialsProvider | The way to authenticate 
s3a. We only support `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` 
and `com.amazonaws.auth.InstanceProfileCredentialsProvider` now. |
+| access_key                       | string  | no       | -                    
                                 | Only used when 
fs.s3a.aws.credentials.provider = 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider                           
                                     |
+| access_secret                    | string  | no       | -                    
                                 | Only used when 
fs.s3a.aws.credentials.provider = 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider                           
                                     |
+| custom_filename                  | boolean | no       | false                
                                 | Whether you need custom the filename         
                                                                                
                                         |
+| file_name_expression             | string  | no       | "${transactionId}"   
                                 | Only used when custom_filename is true       
                                                                                
                                         |
+| filename_time_format             | string  | no       | "yyyy.MM.dd"         
                                 | Only used when custom_filename is true       
                                                                                
                                         |
+| file_format_type                 | string  | no       | "csv"                
                                 |                                              
                                                                                
                                         |
+| field_delimiter                  | string  | no       | '\001'               
                                 | Only used when file_format is text           
                                                                                
                                         |
+| row_delimiter                    | string  | no       | "\n"                 
                                 | Only used when file_format is text           
                                                                                
                                         |
+| have_partition                   | boolean | no       | false                
                                 | Whether you need processing partitions.      
                                                                                
                                         |
+| partition_by                     | array   | no       | -                    
                                 | Only used when have_partition is true        
                                                                                
                                         |
+| partition_dir_expression         | string  | no       | 
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/"            | Only used when 
have_partition is true                                                          
                                                                       |
+| is_partition_field_write_in_file | boolean | no       | false                
                                 | Only used when have_partition is true        
                                                                                
                                         |
+| sink_columns                     | array   | no       |                      
                                 | When this parameter is empty, all fields are 
sink columns                                                                    
                                         |
+| is_enable_transaction            | boolean | no       | true                 
                                 |                                              
                                                                                
                                         |
+| batch_size                       | int     | no       | 1000000              
                                 |                                              
                                                                                
                                         |
+| compress_codec                   | string  | no       | none                 
                                 |                                              
                                                                                
                                         |
+| common-options                   | object  | no       | -                    
                                 |                                              
                                                                                
                                         |
+| max_rows_in_memory               | int     | no       | -                    
                                 | Only used when file_format is excel.         
                                                                                
                                         |
+| sheet_name                       | string  | no       | Sheet${Random 
number}                                 | Only used when file_format is excel.  
                                                                                
                                                |
+| hadoop_s3_properties             | map     | no       |                      
                                 | If you need to add a other option, you could 
add it here and refer to this 
[link](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html)
       |
+|                                  |
 
 ### hadoop_s3_properties [map]
 
@@ -208,6 +242,83 @@ Writer the sheet of the workbook
 
 ## Example
 
+### Simple:
+
+> This example defines a SeaTunnel synchronization task that automatically 
generates data through FakeSource and sends it to S3File Sink. FakeSource 
generates a total of 16 rows of data (row.num=16), with each row having two 
fields, name (string type) and age (int type). The final target s3 dir will 
also create a file and all of the data in write in it.
+> Before run this job, you need create s3 path: /seatunnel/text. And if you 
have not yet installed and deployed SeaTunnel, you need to follow the 
instructions in [Install SeaTunnel](../../start-v2/locally/deployment.md) to 
install and deploy SeaTunnel. And then follow the instructions in [Quick Start 
With SeaTunnel Engine](../../start-v2/locally/quick-start-seatunnel-engine.md) 
to run this job.
+
+```
+# Defining the runtime environment
+env {
+  # You can set flink configuration here
+  execution.parallelism = 1
+  job.mode = "BATCH"
+}
+
+source {
+  # This is a example source plugin **only for test and demonstrate the 
feature source plugin**
+  FakeSource {
+    parallelism = 1
+    result_table_name = "fake"
+    row.num = 16
+    schema = {
+      fields {
+        c_map = "map<string, array<int>>"
+        c_array = "array<int>"
+        name = string
+        c_boolean = boolean
+        age = tinyint
+        c_smallint = smallint
+        c_int = int
+        c_bigint = bigint
+        c_float = float
+        c_double = double
+        c_decimal = "decimal(16, 1)"
+        c_null = "null"
+        c_bytes = bytes
+        c_date = date
+        c_timestamp = timestamp
+      }
+    }
+  }
+  # If you would like to get more information about how to configure seatunnel 
and see full list of source plugins,
+  # please go to https://seatunnel.apache.org/docs/category/source-v2
+}
+
+transform {
+  # If you would like to get more information about how to configure seatunnel 
and see full list of transform plugins,
+    # please go to https://seatunnel.apache.org/docs/category/transform-v2
+}
+
+sink {
+    S3File {
+      bucket = "s3a://seatunnel-test"
+      tmp_path = "/tmp/seatunnel"
+      path="/seatunnel/text"
+      fs.s3a.endpoint="s3.cn-north-1.amazonaws.com.cn"
+      
fs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider"
+      file_format_type = "text"
+      field_delimiter = "\t"
+      row_delimiter = "\n"
+      have_partition = true
+      partition_by = ["age"]
+      partition_dir_expression = "${k0}=${v0}"
+      is_partition_field_write_in_file = true
+      custom_filename = true
+      file_name_expression = "${transactionId}_${now}"
+      filename_time_format = "yyyy.MM.dd"
+      sink_columns = ["name","age"]
+      is_enable_transaction=true
+      hadoop_s3_properties {
+        "fs.s3a.buffer.dir" = "/data/st_test/s3a"
+        "fs.s3a.fast.upload.buffer" = "disk"
+      }
+  }
+  # If you would like to get more information about how to configure seatunnel 
and see full list of sink plugins,
+  # please go to https://seatunnel.apache.org/docs/category/sink-v2
+}
+```
+
 For text file format with `have_partition` and `custom_filename` and 
`sink_columns` and `com.amazonaws.auth.InstanceProfileCredentialsProvider`
 
 ```hocon
diff --git a/docs/en/connector-v2/source/S3File.md 
b/docs/en/connector-v2/source/S3File.md
index f7ad1cc8bd..54124a3703 100644
--- a/docs/en/connector-v2/source/S3File.md
+++ b/docs/en/connector-v2/source/S3File.md
@@ -1,22 +1,14 @@
 # S3File
 
-> S3 file source connector
+> S3 File Source Connector
 
-## Description
-
-Read data from aws s3 file system.
-
-:::tip
-
-If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
-
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Engine. You can check the jar package under 
${SEATUNNEL_HOME}/lib to confirm this.
+## Support Those Engines
 
-To use this connector you need put hadoop-aws-3.1.4.jar and 
aws-java-sdk-bundle-1.11.271.jar in ${SEATUNNEL_HOME}/lib dir.
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
 
-:::
-
-## Key features
+## Key Features
 
 - [x] [batch](../../concept/connector-v2-features.md)
 - [ ] [stream](../../concept/connector-v2-features.md)
@@ -35,104 +27,31 @@ Read all the data in a split in a pollNext call. What 
splits are read will be sa
   - [x] json
   - [x] excel
 
-## Options
-
-|              name               |  type   | required |                     
default value                     |
-|---------------------------------|---------|----------|-------------------------------------------------------|
-| path                            | string  | yes      | -                     
                                |
-| file_format_type                | string  | yes      | -                     
                                |
-| bucket                          | string  | yes      | -                     
                                |
-| fs.s3a.endpoint                 | string  | yes      | -                     
                                |
-| fs.s3a.aws.credentials.provider | string  | yes      | 
com.amazonaws.auth.InstanceProfileCredentialsProvider |
-| read_columns                    | list    | no       | -                     
                                |
-| access_key                      | string  | no       | -                     
                                |
-| access_secret                   | string  | no       | -                     
                                |
-| hadoop_s3_properties            | map     | no       | -                     
                                |
-| delimiter                       | string  | no       | \001                  
                                |
-| parse_partition_from_path       | boolean | no       | true                  
                                |
-| date_format                     | string  | no       | yyyy-MM-dd            
                                |
-| datetime_format                 | string  | no       | yyyy-MM-dd HH:mm:ss   
                                |
-| time_format                     | string  | no       | HH:mm:ss              
                                |
-| skip_header_row_number          | long    | no       | 0                     
                                |
-| schema                          | config  | no       | -                     
                                |
-| common-options                  |         | no       | -                     
                                |
-| sheet_name                      | string  | no       | -                     
                                |
-| file_filter_pattern             | string  | no       | -                     
                                |
-
-### path [string]
-
-The source file path.
-
-### fs.s3a.endpoint [string]
-
-fs s3a endpoint
-
-### fs.s3a.aws.credentials.provider [string]
-
-The way to authenticate s3a. We only support 
`org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` and 
`com.amazonaws.auth.InstanceProfileCredentialsProvider` now.
-
-More information about the credential provider you can see [Hadoop AWS 
Document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Simple_name.2Fsecret_credentials_with_SimpleAWSCredentialsProvider.2A)
-
-### delimiter [string]
-
-Field delimiter, used to tell connector how to slice and dice fields when 
reading text files
-
-default `\001`, the same as hive's default delimiter
-
-### parse_partition_from_path [boolean]
-
-Control whether parse the partition keys and values from file path
-
-For example if you read a file from path 
`s3n://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`
-
-Every record data from file will be added these two fields:
-
-|     name      | age |
-|---------------|-----|
-| tyrantlucifer | 26  |
-
-Tips: **Do not define partition fields in schema option**
-
-### date_format [string]
-
-Date type format, used to tell connector how to convert string to date, 
supported as the following formats:
-
-`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`
-
-default `yyyy-MM-dd`
-
-### datetime_format [string]
-
-Datetime type format, used to tell connector how to convert string to 
datetime, supported as the following formats:
-
-`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` `yyyy/MM/dd HH:mm:ss` 
`yyyyMMddHHmmss`
-
-default `yyyy-MM-dd HH:mm:ss`
-
-### time_format [string]
-
-Time type format, used to tell connector how to convert string to time, 
supported as the following formats:
-
-`HH:mm:ss` `HH:mm:ss.SSS`
-
-default `HH:mm:ss`
+## Description
 
-### skip_header_row_number [long]
+Read data from aws s3 file system.
 
-Skip the first few lines, but only for the txt and csv.
+## Supported DataSource Info
 
-For example, set like following:
+| Datasource | Supported versions |
+|------------|--------------------|
+| S3         | current            |
 
-`skip_header_row_number = 2`
+## Dependency
 
-then SeaTunnel will skip the first 2 lines from source files
+> If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 
2.x.<br/>
+>
+> If you use SeaTunnel Zeta, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Zeta. You can check the jar package under 
${SEATUNNEL_HOME}/lib to confirm this.<br/>
+> To use this connector you need put hadoop-aws-3.1.4.jar and 
aws-java-sdk-bundle-1.11.271.jar in ${SEATUNNEL_HOME}/lib dir.
 
-### file_format_type [string]
+## Data Type Mapping
 
-File type, supported as the following file types:
+Data type mapping is related to the type of file being read, We supported as 
the following file types:
 
 `text` `csv` `parquet` `orc` `json` `excel`
 
+### JSON File Type
+
 If you assign file type to `json`, you should also assign schema option to 
tell connector how to parse data to the row you want.
 
 For example:
@@ -174,7 +93,7 @@ connector will generate data as the following:
 |------|-------------|---------|
 | 200  | get success | true    |
 
-If you assign file type to `parquet` `orc`, schema option not required, 
connector can find the schema of upstream data automatically.
+### Text Or CSV File Type
 
 If you assign file type to `text` `csv`, you can choose to specify the schema 
information or not.
 
@@ -215,61 +134,102 @@ connector will generate data as the following:
 |---------------|-----|--------|
 | tyrantlucifer | 26  | male   |
 
-### bucket [string]
-
-The bucket address of s3 file system, for example: `s3n://seatunnel-test`, if 
you use `s3a` protocol, this parameter should be `s3a://seatunnel-test`.
-
-### access_key [string]
-
-The access key of s3 file system. If this parameter is not set, please confirm 
that the credential provider chain can be authenticated correctly, you could 
check this 
[hadoop-aws](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html)
-
-### access_secret [string]
-
-The access secret of s3 file system. If this parameter is not set, please 
confirm that the credential provider chain can be authenticated correctly, you 
could check this 
[hadoop-aws](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html)
+### Orc File Type
 
-### hadoop_s3_properties [map]
-
-If you need to add a other option, you could add it here and refer to this 
[hadoop-aws](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html)
-
-```
-hadoop_s3_properties {
-      "xxx" = "xxx"
-   }
-```
-
-### schema [config]
-
-#### fields [Config]
-
-The schema of upstream data.
-
-### read_columns [list]
-
-The read column list of the data source, user can use it to implement field 
projection.
-
-The file type supported column projection as the following shown:
-
-- text
-- json
-- csv
-- orc
-- parquet
-- excel
+If you assign file type to `parquet` `orc`, schema option not required, 
connector can find the schema of upstream data automatically.
 
-**Tips: If the user wants to use this feature when reading `text` `json` `csv` 
files, the schema option must be configured**
+|          Orc Data type           |                      SeaTunnel Data type  
                     |
+|----------------------------------|----------------------------------------------------------------|
+| BOOLEAN                          | BOOLEAN                                   
                     |
+| INT                              | INT                                       
                     |
+| BYTE                             | BYTE                                      
                     |
+| SHORT                            | SHORT                                     
                     |
+| LONG                             | LONG                                      
                     |
+| FLOAT                            | FLOAT                                     
                     |
+| DOUBLE                           | DOUBLE                                    
                     |
+| BINARY                           | BINARY                                    
                     |
+| STRING<br/>VARCHAR<br/>CHAR<br/> | STRING                                    
                     |
+| DATE                             | LOCAL_DATE_TYPE                           
                     |
+| TIMESTAMP                        | LOCAL_DATE_TIME_TYPE                      
                     |
+| DECIMAL                          | DECIMAL                                   
                     |
+| LIST(STRING)                     | STRING_ARRAY_TYPE                         
                     |
+| LIST(BOOLEAN)                    | BOOLEAN_ARRAY_TYPE                        
                     |
+| LIST(TINYINT)                    | BYTE_ARRAY_TYPE                           
                     |
+| LIST(SMALLINT)                   | SHORT_ARRAY_TYPE                          
                     |
+| LIST(INT)                        | INT_ARRAY_TYPE                            
                     |
+| LIST(BIGINT)                     | LONG_ARRAY_TYPE                           
                     |
+| LIST(FLOAT)                      | FLOAT_ARRAY_TYPE                          
                     |
+| LIST(DOUBLE)                     | DOUBLE_ARRAY_TYPE                         
                     |
+| Map<K,V>                         | MapType, This type of K and V will 
transform to SeaTunnel type |
+| STRUCT                           | SeaTunnelRowType                          
                     |
+
+### Parquet File Type
 
-### common options
+If you assign file type to `parquet` `orc`, schema option not required, 
connector can find the schema of upstream data automatically.
 
-Source plugin common parameters, please refer to [Source Common 
Options](common-options.md) for details.
+|    Orc Data type     |                      SeaTunnel Data type              
         |
+|----------------------|----------------------------------------------------------------|
+| INT_8                | BYTE                                                  
         |
+| INT_16               | SHORT                                                 
         |
+| DATE                 | DATE                                                  
         |
+| TIMESTAMP_MILLIS     | TIMESTAMP                                             
         |
+| INT64                | LONG                                                  
         |
+| INT96                | TIMESTAMP                                             
         |
+| BINARY               | BYTES                                                 
         |
+| FLOAT                | FLOAT                                                 
         |
+| DOUBLE               | DOUBLE                                                
         |
+| BOOLEAN              | BOOLEAN                                               
         |
+| FIXED_LEN_BYTE_ARRAY | TIMESTAMP<br/> DECIMAL                                
         |
+| DECIMAL              | DECIMAL                                               
         |
+| LIST(STRING)         | STRING_ARRAY_TYPE                                     
         |
+| LIST(BOOLEAN)        | BOOLEAN_ARRAY_TYPE                                    
         |
+| LIST(TINYINT)        | BYTE_ARRAY_TYPE                                       
         |
+| LIST(SMALLINT)       | SHORT_ARRAY_TYPE                                      
         |
+| LIST(INT)            | INT_ARRAY_TYPE                                        
         |
+| LIST(BIGINT)         | LONG_ARRAY_TYPE                                       
         |
+| LIST(FLOAT)          | FLOAT_ARRAY_TYPE                                      
         |
+| LIST(DOUBLE)         | DOUBLE_ARRAY_TYPE                                     
         |
+| Map<K,V>             | MapType, This type of K and V will transform to 
SeaTunnel type |
+| STRUCT               | SeaTunnelRowType                                      
         |
 
-### sheet_name [string]
+## Options
 
-Reader the sheet of the workbook,Only used when file_format_type is excel.
+|              name               |  type   | required |                     
default value                     |                                             
                                                                                
                                                                   Description  
                                                                                
                                                                                
                [...]
+|---------------------------------|---------|----------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| path                            | string  | yes      | -                     
                                | The s3 path that needs to be read can have 
sub paths, but the sub paths need to meet certain format requirements. Specific 
requirements can be referred to "parse_partition_from_path" option              
                                                                                
                                                                                
                 [...]
+| file_format_type                | string  | yes      | -                     
                                | File type, supported as the following file 
types: `text` `csv` `parquet` `orc` `json` `excel`                              
                                                                                
                                                                                
                                                                                
                 [...]
+| bucket                          | string  | yes      | -                     
                                | The bucket address of s3 file system, for 
example: `s3n://seatunnel-test`, if you use `s3a` protocol, this parameter 
should be `s3a://seatunnel-test`.                                               
                                                                                
                                                                                
                       [...]
+| fs.s3a.endpoint                 | string  | yes      | -                     
                                | fs s3a endpoint                               
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| fs.s3a.aws.credentials.provider | string  | yes      | 
com.amazonaws.auth.InstanceProfileCredentialsProvider | The way to authenticate 
s3a. We only support `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` 
and `com.amazonaws.auth.InstanceProfileCredentialsProvider` now. More 
information about the credential provider you can see [Hadoop AWS 
Document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Simple_name.2Fsecret_credentials_with_SimpleAWSCredenti
 [...]
+| read_columns                    | list    | no       | -                     
                                | The read column list of the data source, user 
can use it to implement field projection. The file type supported column 
projection as the following shown: `text` `csv` `parquet` `orc` `json` `excel` 
. If the user wants to use this feature when reading `text` `json` `csv` files, 
the "schema" option must be configured.                                         
                      [...]
+| access_key                      | string  | no       | -                     
                                | Only used when 
`fs.s3a.aws.credentials.provider = 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider `                         
                                                                                
                                                                                
                                                                                
          [...]
+| access_secret                   | string  | no       | -                     
                                | Only used when 
`fs.s3a.aws.credentials.provider = 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider `                         
                                                                                
                                                                                
                                                                                
          [...]
+| hadoop_s3_properties            | map     | no       | -                     
                                | If you need to add other option, you could 
add it here and refer to this 
[link](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html)
                                                                                
                                                                                
                                                               [...]
+| delimiter                       | string  | no       | \001                  
                                | Field delimiter, used to tell connector how 
to slice and dice fields when reading text files. Default `\001`, the same as 
hive's default delimiter.                                                       
                                                                                
                                                                                
                  [...]
+| parse_partition_from_path       | boolean | no       | true                  
                                | Control whether parse the partition keys and 
values from file path. For example if you read a file from path 
`s3n://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`. Every 
record data from file will be added these two fields: name="tyrantlucifer", 
age=16                                                                          
                                     [...]
+| date_format                     | string  | no       | yyyy-MM-dd            
                                | Date type format, used to tell connector how 
to convert string to date, supported as the following formats:`yyyy-MM-dd` 
`yyyy.MM.dd` `yyyy/MM/dd`. default `yyyy-MM-dd`                                 
                                                                                
                                                                                
                    [...]
+| datetime_format                 | string  | no       | yyyy-MM-dd HH:mm:ss   
                                | Datetime type format, used to tell connector 
how to convert string to datetime, supported as the following 
formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` `yyyy/MM/dd HH:mm:ss` 
`yyyyMMddHHmmss`                                                                
                                                                                
                                       [...]
+| time_format                     | string  | no       | HH:mm:ss              
                                | Time type format, used to tell connector how 
to convert string to time, supported as the following formats:`HH:mm:ss` 
`HH:mm:ss.SSS`                                                                  
                                                                                
                                                                                
                      [...]
+| skip_header_row_number          | long    | no       | 0                     
                                | Skip the first few lines, but only for the 
txt and csv. For example, set like following:`skip_header_row_number = 2`. Then 
SeaTunnel will skip the first 2 lines from source files                         
                                                                                
                                                                                
                 [...]
+| schema                          | config  | no       | -                     
                                | The schema of upstream data.                  
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| common-options                  |         | no       | -                     
                                | Source plugin common parameters, please refer 
to [Source Common Options](common-options.md) for details.                      
                                                                                
                                                                                
                                                                                
              [...]
+| sheet_name                      | string  | no       | -                     
                                | Reader the sheet of the workbook,Only used 
when file_format is excel.                                                      
                                                                                
                                                                                
                                                                                
                 [...]
 
 ## Example
 
-```hocon
+1. In this example, We read data from s3 path 
`s3a://seatunnel-test/seatunnel/text` and the file type is orc in this path.
+   We use `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` to 
authentication so `access_key` and `secret_key` is required.
+   All columns in the file will be read and send to sink.
+
+```
+# Defining the runtime environment
+env {
+  # You can set flink configuration here
+  execution.parallelism = 1
+  job.mode = "BATCH"
+}
 
+source {
   S3File {
     path = "/seatunnel/text"
     fs.s3a.endpoint="s3.cn-north-1.amazonaws.com.cn"
@@ -279,9 +239,21 @@ Reader the sheet of the workbook,Only used when 
file_format_type is excel.
     bucket = "s3a://seatunnel-test"
     file_format_type = "orc"
   }
+}
+
+transform {
+  # If you would like to get more information about how to configure seatunnel 
and see full list of transform plugins,
+    # please go to https://seatunnel.apache.org/docs/category/transform-v2
+}
 
+sink {
+  Console {}
+}
 ```
 
+2. Use `InstanceProfileCredentialsProvider` to authentication
+   The file type in S3 is json, so need config schema option.
+
 ```hocon
 
   S3File {
@@ -300,9 +272,47 @@ Reader the sheet of the workbook,Only used when 
file_format_type is excel.
 
 ```
 
-### file_filter_pattern [string]
+3. Use `InstanceProfileCredentialsProvider` to authentication
+   The file type in S3 is json and has five fields (`id`, `name`, `age`, 
`sex`, `type`), so need config schema option.
+   In this job, we only need send `id` and `name` column to mysql.
 
-Filter pattern, which used for filtering files.
+```
+# Defining the runtime environment
+env {
+  # You can set flink configuration here
+  execution.parallelism = 1
+  job.mode = "BATCH"
+}
+
+source {
+  S3File {
+    path = "/seatunnel/json"
+    bucket = "s3a://seatunnel-test"
+    fs.s3a.endpoint="s3.cn-north-1.amazonaws.com.cn"
+    
fs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider"
+    file_format_type = "json"
+    read_columns = ["id", "name"]
+    schema {
+      fields {
+        id = int 
+        name = string
+        age = int
+        sex = int
+        type = string
+      }
+    }
+  }
+}
+
+transform {
+  # If you would like to get more information about how to configure seatunnel 
and see full list of transform plugins,
+    # please go to https://seatunnel.apache.org/docs/category/transform-v2
+}
+
+sink {
+  Console {}
+}
+```
 
 ## Changelog

[seatunnel] branch dev updated: [Doc] Improve S3File Source & S3File Sink document (#5101)

Reply via email to