fapaul commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785706129
##
File path: docs/content/docs/connectors/datastream/filesystem.md
##
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
under the License.
-->
-# File Sink
+# FileSystem
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes
partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING`
that reads or writes (partitioned) files to filesystems
supported by the [Flink `FileSystem` abstraction]({{< ref
"docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it
is an evolution of the
-existing [Streaming File Sink]({{< ref
"docs/connectors/datastream/streamfile_sink" >}}) which was designed for
providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is
designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any
(distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}})
(e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref
"docs/dev/datastream/sources" >}}#the-data-source-api),
+a unified data source that reads files - both in batch and in streaming mode.
+It is divided into the following two parts: File SplitEnumerator and File
SourceReader.
+
+* File `SplitEnumerator` is responsible for discovering and identifying the
files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the
file from the filesystem.
+
+You'll need to combine the File Source with a [format]({{< ref
"docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+ Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a
recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for
periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but
after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously
detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream.
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all
the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the
given path(s).
+In the continuous/streaming case, the source periodically checks the paths for
new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder`
created through one of the above-mentioned methods)
+the source is by default in bounded/batch mode. Call
`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource source =
+FileSource.forRecordStreamFormat(...)
+.monitorContinuously(Duration.ofMillis(5))
+.build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats.
+These define the parsing logic for the contents of the file. There are
multiple classes that the source supports.
+Their interfaces trade of simplicity of implementation and
flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the
simplest format to implement,
+and provides many features out-of-the-box (like checkpointing logic) but is
limited in the optimizations it can apply
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time.
+It is the most "low level" format to implement, but offers the greatest
flexibility to optimize the implementation.
+
+ TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream
using
+various supported charset encodings.
+This format