[GitHub] [flink] galenwarren commented on a change in pull request #18430: [FLINK-25577][docs] Update GCS documentation

GitBox Mon, 24 Jan 2022 16:19:36 -0800


galenwarren commented on a change in pull request #18430:
URL: https://github.com/apache/flink/pull/18430#discussion_r791256107




##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ 
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect 
Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there 
is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the 
*gs://* scheme. It uses Google's 
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
 Hadoop library to access GCS. It also uses Google's 
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
 library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref 
"docs/connectors/datastream/file_sink" >}}).

Review comment:
       Updated in 
https://github.com/apache/flink/pull/18430/commits/4d4b6ef9bf6c75d3dbbe3d64b58eb5ffcc218ebf.

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ 
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect 
Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there 
is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the 
*gs://* scheme. It uses Google's 
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
 Hadoop library to access GCS. It also uses Google's 
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
 library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref 
"docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the 
`plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation 
on Google Cloud Storage 
authentication](https://cloud.google.com/storage/docs/authentication) for more 
information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs 
configuration 
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
 by adding the configurations to your `flink-conf.yaml`.

Review comment:
       Yes, that's better. Good catch. Updated in 
https://github.com/apache/flink/pull/18430/commits/4d4b6ef9bf6c75d3dbbe3d64b58eb5ffcc218ebf.

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ 
env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other 
places as well, including your [high availability setup]({{< ref 
"docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref 
"docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that 
Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect 
Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there 
is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the 
*gs://* scheme. It uses Google's 
[gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector)
 Hadoop library to access GCS. It also uses Google's 
[google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage)
 library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 
2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref 
"docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref 
"docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the 
`plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation 
on Google Cloud Storage 
authentication](https://cloud.google.com/storage/docs/authentication) for more 
information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs 
configuration 
keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)
 by adding the configurations to your `flink-conf.yaml`.
 
-  ```xml
-  <configuration>
-    <property>
-      <name>google.cloud.auth.service.account.enable</name>
-      <value>true</value>
-    </property>
-    <property>
-      <name>google.cloud.auth.service.account.json.keyfile</name>
-      <value><PATH TO GOOGLE AUTHENTICATION JSON></value>
-    </property>
-  </configuration>
-  ```
+For example, Hadoop has a `fs.gs.http.connect-timeout` configuration key. If 
you want to change it, you need to set `gs.http.connect-timeout: xyz` in 
`flink-conf.yaml`. Flink will internally translate this back to 
`fs.gs.http.connect-timeout`. There is no need to pass configuration parameters 
using Hadoop's XML configuration files.

Review comment:
       Updated in 
https://github.com/apache/flink/pull/18430/commits/4d4b6ef9bf6c75d3dbbe3d64b58eb5ffcc218ebf.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] galenwarren commented on a change in pull request #18430: [FLINK-25577][docs] Update GCS documentation

Reply via email to