Re: FlinkFileIO implementation

Jean-Baptiste Onofré Thu, 25 Apr 2024 09:48:11 -0700

Hi Peter,

On a similar topic, I created a PR to support custom schema in
ResolvingFileIO (https://github.com/apache/iceberg/pull/9884). Maybe
the FlinkIO can be a new schema/extension in the ResolvingFileIO.


If I agree that it would be interesting to have support for
FlinkFileIO, I'm not sure it's a good idea to have it directly in the
Iceberg. I think it would be great to leverage the extension mechanism
we have in Iceberg (FileIO/ResolvingFileIO).
Iceberg Core should not include engine specific dependency imho.
However, having a "flink:" schema in ResolvingFileIO where we can
leverage FlinkFileIO could be interesting.

Just thinking out loud :)

Regards
JB

On Fri, Apr 19, 2024 at 12:08 PM Péter Váry <peter.vary.apa...@gmail.com> wrote:
>
> Hi Iceberg Team,
>
> Flink has its own FileSystem implementation. See: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/.
> This FileSystem already has several implementations:
>
> Hadoop
> Azure
> S3
> Google Cloud Storage
> ...
>
> As a general rule in Flink, one should use this FileSystem to consume and 
> persistently store data.
> If these FileSystems are configured, then Flink makes sure that the 
> configurations are consistent and available for the JM/TM.
> Also as an added benefit, delegation tokens are handled and distributed for 
> these FileSystems automatically. See: 
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-delegation-token/
>
> In house, some of our new users are struggling with parametrizing 
> HadooFileIO, and S3FileIO for Iceberg, trying to wrap their head around that 
> they have to provide different configurations for the checkpointing and for 
> the Iceberg table storage (even if they are stored in the same bucket, or on 
> the same HDFS cluster)
>
> I have created a PR, which provides a FileIO implementation which uses 
> FlinkFileSystem. Very imaginatively I have named it FlinkFileIO. See: 
> https://github.com/apache/iceberg/pull/10151
>
> This would allow the users to configure the FileSystem only once, and use 
> this FileSystem to access Iceberg tables. Also, if for whatever reason the 
> global nature of flink file system config is limiting, the users still could 
> revert back using the other FileIO implementations.
>
> What do you think? Would this be a useful addition to the Iceberg-Flink 
> integration?
>
> Thanks,
> Peter

Re: FlinkFileIO implementation

Reply via email to