[jira] [Comment Edited] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-15 Thread Jamie Grier (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345032#comment-17345032
 ] 

Jamie Grier edited comment on FLINK-19481 at 5/15/21, 1:25 PM:
---

The primary benefits of a native implementation are described earlier in this 
ticket.  This is based on my own experience in production for several years 
with the other Hadoop based File Systems – primarily the S3 one though.

 
{noformat}
 {noformat}


was (Author: jgrier):
The primary benefits of a native implementation are described earlier in this 
ticket.  This is based on my own experience in production for several years 
with the other Hadoop based File Systems – primarily the S3 one though.

 
{noformat}
I think a native GCS filesytem would be a major benefit to Flink users.  The 
only way to support GCS currently is, as stated, through the Hadoop Filesystem 
implementation which brings several problems along with it.  The two largest 
problems I've experienced are:1) Hadoop has a huge dependency footprint which 
is a significant headache for Flink application developers dealing with 
dependency-hell.2) The total stack of FileSystem abstractions on this path 
becomes very difficult to tune, understand, and support.  By stack I'm 
referring to Flink's own FileSystem abstraction, then the Hadoop layer, then 
the GCS libraries.  This is very difficult to work with in production as each 
layer has its own intricacies, connection pools, thread pools, tunable 
configuration, versions, dependency versions, etc.Having gone down this path 
with the old-style Hadoop+S3 filesystem approach I know how difficult it can be 
and a native implementation should prove to be much simpler to support and 
easier to tune and modify for performance.  This is why the presto-s3-fs 
filesystem was adopted, for example.{noformat}

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-15 Thread Jamie Grier (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345032#comment-17345032
 ] 

Jamie Grier edited comment on FLINK-19481 at 5/15/21, 1:25 PM:
---

The primary benefits of a native implementation are described earlier in this 
ticket.  This is based on my own experience in production for several years 
with the other Hadoop based File Systems – primarily the S3 one though.

 

https://issues.apache.org/jira/browse/FLINK-19481?focusedCommentId=17211477&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17211477
 
{noformat}
 {noformat}


was (Author: jgrier):
The primary benefits of a native implementation are described earlier in this 
ticket.  This is based on my own experience in production for several years 
with the other Hadoop based File Systems – primarily the S3 one though.

 
{noformat}
 {noformat}

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-16 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345823#comment-17345823
 ] 

Xintong Song edited comment on FLINK-19481 at 5/17/21, 2:08 AM:


Hi [~jgrier], thanks for your input.

I have noticed your earlier comment. However, that comment was before Galen's 
PR and I think things are a bit different with this PR now.
- IIUC, what you've described are benefits of a native implementation 
*comparing to the current status*, where Flink does not provide any specific 
supports for GS and users have to deal with the Hadoop dependencies and Flink's 
FS abstractions by themselves. 
- What I'm trying to understand are the benefits *comparing to the status once 
Galen's PR is merged*. The PR provides an out-of-box Hadoop-based GS FS 
implementation, so that users no longer need to deal with the dependencies and 
abstractions. In that case, is it still beneficial that this implementation, 
internally, is built directly on top of the GCS native SDK, rather than 
leveraging the existing Hadoop stack provided by google storage connector?


was (Author: xintongsong):
Hi [~jgrier], thanks for your input.

I have noticed your earlier comment. However, that comment was before Galen's 
PR and I think things are a bit different with this PR now.
- IIUC, what you've described are benefits of a native implementation 
*comparing to the current status*, where Flink does not provide any specific 
supports for GS and users have to deal with the Hadoop dependencies and Flink's 
FS abstractions by themselves. 
- What I'm trying to understand are the benefits *comparing to the status once 
Galen's PR is merged*. The PR provides an out-of-box GS FS implementation, so 
that users no longer need to deal with the dependencies and abstractions. In 
that case, is it still beneficial that this implementation, internally, is 
built directly on top of the GCS native SDK, rather than leveraging the 
existing Hadoop stack provided by google storage connector?

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)