[GitHub] [flink] fapaul commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

2022-01-17 Thread GitBox


fapaul commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r786117367



##
File path: docs/content/docs/connectors/datastream/filesystem.md
##
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes 
partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` 
that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref 
"docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it 
is an evolution of the 
-existing [Streaming File Sink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) which was designed for 
providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is 
designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any 
(distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) 
(e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref 
"docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File 
SourceReader. 

Review comment:
   I guess we are mixing various technical concepts here. The interface to 
build generally applicable enumerators for the FLIP-27 source API is called 
`SplitEnumerator` and we have another abstraction that is the `FileEnumerator`. 
Probably `FileEnumerator` is not a good name and it should be more like 
`FileEnumerationStrategy`.
   
   I agree only seeing both terms is very confusing. I tend to use 
`SplitEnumerator` where ever possible and maybe have a special section about 
customizing the file enumeration strategy.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] fapaul commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

2022-01-16 Thread GitBox


fapaul commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785706129



##
File path: docs/content/docs/connectors/datastream/filesystem.md
##
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes 
partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` 
that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref 
"docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it 
is an evolution of the 
-existing [Streaming File Sink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) which was designed for 
providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is 
designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any 
(distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) 
(e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref 
"docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File 
SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the 
files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the 
file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref 
"docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+ Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a 
recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for 
periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but 
after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously 
detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all 
the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the 
given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for 
new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` 
created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call 
`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource source =
+FileSource.forRecordStreamFormat(...)
+.monitorContinuously(Duration.ofMillis(5))  
+.build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are 
multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and 
flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the 
simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is 
limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest 
flexibility to optimize the implementation.
+
+ TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream 
using
+various supported charset encodings.
+This format