Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Till Rohrmann Mon, 30 Mar 2020 02:31:09 -0700

Hi Sivaprasanna,

thanks for starting this discussion. In general I like the idea to remove
duplications and move common code to a shared module. As a recommendation,
I would exclude the whole part about Flink's Hadoop compatibility modules
because they are legacy code and hardly used anymore. This would also have
the benefit of making the scope of the proposal a bit smaller.


What we now need is a committer who wants to help with this effort. It
might be that this takes a bit of time as many of the committers are quite
busy.

Cheers,
Till

On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <sivaprasanna...@gmail.com>
wrote:

> Hi,
>
> Continuing on an earlier discussion[1] regarding having a separate module
> for Hadoop related utility components, I have gone through our project
> briefly and found the following components which I feel could be moved to a
> separate module for reusability, and better module structure.
>
> Module Name Class Name Used at / Remarks
>
> flink-hadoop-fs
> flink.runtime.util.HadoopUtils
> flink-runtime => HadoopModule & HadoopModuleFactory
> flink-swift-fs-hadoop => SwiftFileSystemFactory
> flink-yarn => Utils, YarnClusterDescriptor
>
> flink-hadoop-compatability
> api.java.hadoop.mapred.utils.HadoopUtils
> Both belong to the same module but with different packages
> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> api.java.hadoop.mapreduce.utils.HadoopUtils
> flink-sequeunce-file
> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> it is used at formats.sequencefile.SequenceFileWriterFactory but can also
> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
> pretty much everywhere to avoid NotSerializableException.
>
> *Proposal*
> To summarise, I believe we can create a new module (flink-hadoop-utils ?)
> and move these reusable components to this new module which will have an
> optional/provided dependency on flink-shaded-hadoop-2.
>
> *Structure*
> In the present form, I think we will have two classes with the packaging
> structure being *org.apache.flink.hadoop.[utils/serialization]*
> 1. HadoopUtils with all static methods ( after combining and eliminating
> the duplicate code fragments from the three HadoopUtils classes mentioned
> above)
> 2. Move the existing SerializableHadoopConfiguration from the
> flink-sequence-file to this new module .
>
> *Justification*
> * With this change, we would be stripping away the dependency on
> flink-hadoop-fs from flink-runtime as I don't see any other classes from
> flink-hadoop-fs is being used anywhere in flink-runtime module.
> * We will have a common place where all the utilities related to Hadoop can
> go which can be reused easily without leading to jar hell.
>
> In addition to this, if you are aware of any other classes that fit in this
> approach, please share the details here.
>
> *Note*
> I don't have a complete understanding here but I did see two
> implementations of the following classes under two different packages
> *.mapred and *.mapreduce.
> * HadoopInputFormat
> * HadoopInputFormatBase
> * HadoopOutputFormat
> * HadoopOutputFormatBase
>
> Can we somehow figure and have them in this new module?
>
> Thanks,
> Sivaprasanna
>
> [1]
>
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Reply via email to