MrWhiteSike commented on a change in pull request #18718: URL: https://github.com/apache/flink/pull/18718#discussion_r808717415
########## File path: docs/content.zh/docs/connectors/datastream/filesystem.md ########## @@ -811,61 +809,63 @@ input.sinkTo(sink) {{< /tab >}} {{< /tabs >}} -The `SequenceFileWriterFactory` supports additional constructor parameters to specify compression settings. +`SequenceFileWriterFactory` 提供额外的构造参数去设置是否开启压缩功能。 + +<a name="bucket-assignment"></a> + +### 桶分配 -### Bucket Assignment +桶的逻辑定义了如何将数据分配到基本输出目录内的子目录中。 -The bucketing logic defines how the data will be structured into subdirectories inside the base output directory. +Row-encoded Format 和 Bulk-encoded Format (参考 [Format Types](#sink-format-types)) 使用了 `DateTimeBucketAssigner` 作为默认的分配器。 +默认的分配器 `DateTimeBucketAssigner` 会基于使用了格式为 `yyyy-MM-dd--HH` 的系统默认时区来创建小时桶。日期格式( *即* 桶大小)和时区都可以手动配置。 -Both row and bulk formats (see [File Formats](#file-formats)) use the `DateTimeBucketAssigner` as the default assigner. -By default the `DateTimeBucketAssigner` creates hourly buckets based on the system default timezone -with the following format: `yyyy-MM-dd--HH`. Both the date format (*i.e.* bucket size) and timezone can be -configured manually. +我们可以在格式化构造器中通过调用 `.withBucketAssigner(assigner)` 方法去指定自定义的 `BucketAssigner`。 -We can specify a custom `BucketAssigner` by calling `.withBucketAssigner(assigner)` on the format builders. +Flink 内置了两种 BucketAssigners: -Flink comes with two built-in BucketAssigners: +- `DateTimeBucketAssigner` : 默认的基于时间的分配器 +- `BasePathBucketAssigner` : 分配所有文件存储在基础路径上(单个全局桶) -- `DateTimeBucketAssigner` : Default time based assigner -- `BasePathBucketAssigner` : Assigner that stores all part files in the base path (single global bucket) +<a name="rolling-policy"></a> -### Rolling Policy +### 滚动策略 -The `RollingPolicy` defines when a given in-progress part file will be closed and moved to the pending and later to finished state. -Part files in the "finished" state are the ones that are ready for viewing and are guaranteed to contain valid data that will not be reverted in case of failure. -In `STREAMING` mode, the Rolling Policy in combination with the checkpointing interval (pending files become finished on the next checkpoint) control how quickly -part files become available for downstream readers and also the size and number of these parts. In `BATCH` mode, part-files become visible at the end of the job but -the rolling policy can control their maximum size. +`RollingPolicy` 定义了何时关闭给定的进行中的文件,并将其转换为挂起状态,然后在转换为完成状态。 +完成状态的文件,可供查看并且可以保证数据的有效性,在出现故障时不会恢复。 +在 `STREAMING` 模式下,滚动策略结合 Checkpoint 间隔(到下一个 Checkpoint 完成时挂起状态的文件变成完成状态)共同控制 Part 文件对下游 readers 是否可见以及这些文件的大小和数量。在 `BATCH` 模式下,Part 文件在 Job 最后对下游才变得可见,滚动策略只控制最大的 Part 文件大小。 -Flink comes with two built-in RollingPolicies: +Flink 内置了两种 RollingPolicies: - `DefaultRollingPolicy` - `OnCheckpointRollingPolicy` -### Part file lifecycle +<a name="part-file-lifecycle"></a> -In order to use the output of the `FileSink` in downstream systems, we need to understand the naming and lifecycle of the output files produced. +### Part 文件生命周期 -Part files can be in one of three states: -1. **In-progress** : The part file that is currently being written to is in-progress -2. **Pending** : Closed (due to the specified rolling policy) in-progress files that are waiting to be committed -3. **Finished** : On successful checkpoints (`STREAMING`) or at the end of input (`BATCH`) pending files transition to "Finished" +为了在下游使用 `FileSink` 作为输出,我们需要了解生成的输出文件的命名和生命周期。 -Only finished files are safe to read by downstream systems as those are guaranteed to not be modified later. +Part 文件可以处于以下三种状态中的任意一种: +1. **In-progress** :当前正在被写入的 Part 文件处于 in-progress 状态 +2. **Pending** : (由于指定的滚动策略)关闭 in-progress 状态的文件,并且等待提交 +3. **Finished** : 流模式(`STREAMING`)下的成功的 Checkpoint 或者批模式(`BATCH`)下输入结束,挂起状态文件转换为完成状态 -Each writer subtask will have a single in-progress part file at any given time for every active bucket, but there can be several pending and finished files. +只有完成状态下的文件被下游读取时才是安全的,并且保证不会被修改。 -**Part file example** +对于每个活动的桶,在任何给定时间每个写入 Subtask 中都有一个正在进行的 Part 文件,但可能有多个挂起和完成的文件。 -To better understand the lifecycle of these files let's look at a simple example with 2 sink subtasks: +**Part 文件示例** Review comment: No, it isn't a link tag. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org