Re: [DISCUSS] Unification of Hadoop related IO modules

Robert Bradshaw Fri, 07 Sep 2018 04:46:27 -0700

Agree about not impacting users. Perhaps I misread (3), isn't it fully
backwards compatible as well?


On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <[email protected]> wrote:

> Hi,
>
> in order to limit the impact for the existing users on Beam 2.x series,
> I would go for (1).
>
> Regards
> JB
>
> On 06/09/2018 17:24, Alexey Romanenko wrote:
> > Hello everyone,
> >
> > I’d like to discuss the following topic (see below) with community since
> > the optimal solution is not clear for me.
> >
> > There is Java IO module, called “/hadoop-input-format/”, which allows to
> > use MapReduce InputFormat implementations to read data from different
> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
> > According to its name, it has only “Read" and it's missing “Write” part,
> > so, I'm working on “/hadoop-output-format/” to support MapReduce
> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
> > this I created another module with this name. So, in the end, we will
> > have two different modules “/hadoop-input-format/” and
> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
> > every existed Java IO, that we have, incapsulates Read and Write parts
> > into one module. Additionally, we have “/hadoop-common/” and
> > /“hadoop-file-system/” as other hadoop-related modules.
> >
> > Now I’m thinking how it will be better to organise all these Hadoop
> > modules better. There are several options in my mind:
> >
> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
> > “as it is”.
> > Pros: no breaking changes, no additional work
> > Cons: not logical for users to have the same IO in two different modules
> > and with different names.
> >
> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
> > keep the other Hadoop modules “as it is”.
> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
> > for users
> > Cons: breaking changes for user code because of module/IO renaming
> >
> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
> > which will include new “write” functionality and be a proxy for old
> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
> > become deprecated and be finally moved to common “/hadoop-format/”
> > module in future releases. Keep the other Hadoop modules “as it is”.
> > Pros: finally it will be only one module for hadoop MR format; changes
> > are less painful for user
> > Cons: hidden difficulties of implementation this strategy; a bit
> > confusing for user
> >
> > 4) Add new module “/hadoop/” and move all already existed modules there
> > as submodules (like we have for “/io/google-cloud-platform/”), merge
> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module.
> > Pros: unification of all hadoop-related modules
> > Cons: breaking changes for user code, additional complexity with deps
> > and testing
> >
> > 5) Your suggestion?..
> >
> > My personal preferences are lying between 2 and 3 (if 3 is possible).
> >
> > I’m wondering if there were similar situations in Beam before and how it
> > was finally resolved. If yes then probably we need to do here in similar
> > way.
> > Any suggestions/advices/comments would be very appreciated.
> >
> > Thanks,
> > Alexey
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to