Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Sivaprasanna Thu, 30 Apr 2020 09:55:13 -0700

Bump.

Please let me know, if someone is interested in reviewing this one. I am
willing to start working on this. BTW, a small and new addition to the
list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse
`SerializableHadoopConfiguration` along with SequenceFileWriterFactory and
CompressWriterFactory.


CC - Kostas Kloudas since he has a better understanding on the
`SerializableHadoopConfiguration.`

Cheers,
Sivaprasanna

On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ches...@apache.org> wrote:

> I would recommend to wait until a committer has signed up for reviewing
> your changes before preparing any PR.
> Otherwise the chances are high that you invest a lot of time but the
> changes never get in.
>
> On 30/03/2020 11:42, Sivaprasanna wrote:
> > Hello Till,
> >
> > I agree with having the scope limited and more concentrated. I can file a
> > Jira and get started with the code changes, as and when someone has some
> > bandwidth, the review can also be done. What do you think?
> >
> > Cheers,
> > Sivaprasanna
> >
> > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <trohrm...@apache.org>
> wrote:
> >
> >> Hi Sivaprasanna,
> >>
> >> thanks for starting this discussion. In general I like the idea to
> remove
> >> duplications and move common code to a shared module. As a
> recommendation,
> >> I would exclude the whole part about Flink's Hadoop compatibility
> modules
> >> because they are legacy code and hardly used anymore. This would also
> have
> >> the benefit of making the scope of the proposal a bit smaller.
> >>
> >> What we now need is a committer who wants to help with this effort. It
> >> might be that this takes a bit of time as many of the committers are
> quite
> >> busy.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <sivaprasanna...@gmail.com
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Continuing on an earlier discussion[1] regarding having a separate
> module
> >>> for Hadoop related utility components, I have gone through our project
> >>> briefly and found the following components which I feel could be moved
> >> to a
> >>> separate module for reusability, and better module structure.
> >>>
> >>> Module Name Class Name Used at / Remarks
> >>>
> >>> flink-hadoop-fs
> >>> flink.runtime.util.HadoopUtils
> >>> flink-runtime => HadoopModule & HadoopModuleFactory
> >>> flink-swift-fs-hadoop => SwiftFileSystemFactory
> >>> flink-yarn => Utils, YarnClusterDescriptor
> >>>
> >>> flink-hadoop-compatability
> >>> api.java.hadoop.mapred.utils.HadoopUtils
> >>> Both belong to the same module but with different packages
> >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> >>> api.java.hadoop.mapreduce.utils.HadoopUtils
> >>> flink-sequeunce-file
> >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> >>> it is used at formats.sequencefile.SequenceFileWriterFactory but can
> also
> >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
> >>> pretty much everywhere to avoid NotSerializableException.
> >>>
> >>> *Proposal*
> >>> To summarise, I believe we can create a new module (flink-hadoop-utils
> ?)
> >>> and move these reusable components to this new module which will have
> an
> >>> optional/provided dependency on flink-shaded-hadoop-2.
> >>>
> >>> *Structure*
> >>> In the present form, I think we will have two classes with the
> packaging
> >>> structure being *org.apache.flink.hadoop.[utils/serialization]*
> >>> 1. HadoopUtils with all static methods ( after combining and
> eliminating
> >>> the duplicate code fragments from the three HadoopUtils classes
> mentioned
> >>> above)
> >>> 2. Move the existing SerializableHadoopConfiguration from the
> >>> flink-sequence-file to this new module .
> >>>
> >>> *Justification*
> >>> * With this change, we would be stripping away the dependency on
> >>> flink-hadoop-fs from flink-runtime as I don't see any other classes
> from
> >>> flink-hadoop-fs is being used anywhere in flink-runtime module.
> >>> * We will have a common place where all the utilities related to Hadoop
> >> can
> >>> go which can be reused easily without leading to jar hell.
> >>>
> >>> In addition to this, if you are aware of any other classes that fit in
> >> this
> >>> approach, please share the details here.
> >>>
> >>> *Note*
> >>> I don't have a complete understanding here but I did see two
> >>> implementations of the following classes under two different packages
> >>> *.mapred and *.mapreduce.
> >>> * HadoopInputFormat
> >>> * HadoopInputFormatBase
> >>> * HadoopOutputFormat
> >>> * HadoopOutputFormatBase
> >>>
> >>> Can we somehow figure and have them in this new module?
> >>>
> >>> Thanks,
> >>> Sivaprasanna
> >>>
> >>> [1]
> >>>
> >>>
> >>
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
>
>
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Reply via email to