[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

Xintong Song (Jira) Mon, 17 May 2021 19:34:15 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346542#comment-17346542
 ]


Xintong Song commented on FLINK-19481:
--------------------------------------

{quote}The runtime complexity of having the additional Hadoop layer will likely 
be strictly worse. This is because each layer has it's own configuration and 
things like thread pooling, pool sizes, buffering, and other non-trivial tuning 
parameters.
{quote}
I'm not sure about this. Looking into o.a.f.runtime.fs.hdfs.HadoopFileSystem, 
the Flink filesystem is practically a layer of API mappings around the Hadoop 
filesystem. It might be true that the parameters to be tuned are separated into 
different layers, but I wonder how many extra parameters, thus complexity, are 
introduced due to the additional layer. Shouldn't the total amount of 
parameters be the same?
{quote}In my experience the more native (fewer layers of abstraction) you can 
achieve the better the result.
{quote}
I admit that, if we are building the GCS file system from the ground up, the 
less layers the better. 
 # GCS SDK -> Hadoop FileSystem -> Flink FileSystem
 # GCS SDK -> Flink FileSystem

However, we don't have to build everything from the ground up. In the first 
path above, there are already off-the-shelf solution for both mappings (google 
connector for sdk -> hadoop fs, and o.a.f.runtime.fs.hdfs.HadoopFileSystem for 
hadoop-> flink). It requires almost no extra efforts in addition to assembling 
existing artifacts. On the other hand, in the second path we need to implement 
a brand new file system, which seems to be re-inventing the wheel.
{quote}It seems from reading the comments here though that a good solution 
would be a hybrid of Ben's work on the native GCS Filesystem combined with 
Galen's work on the RecoverableWriter.
{quote}
Unless there're more inputs on why we should have a native GCS file system, I'm 
leaning towards not introducing such a native implementation based on the 
discussion so far.

 

> Add support for a flink native GCS FileSystem
> ---------------------------------------------
>
>                 Key: FLINK-19481
>                 URL: https://issues.apache.org/jira/browse/FLINK-19481
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, FileSystems
>    Affects Versions: 1.12.0
>            Reporter: Ben Augarten
>            Priority: Minor
>              Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

Reply via email to