Re: [DISCUSS] Unification of Hadoop related IO modules

Robert Bradshaw Fri, 07 Sep 2018 06:15:28 -0700

OK, good, that's what I thought. So I stick by (3) which

1) Cleans up the library for all future uses (hopefully the majority of all
users :).
2) Is fully backwards compatible for existing users, minimizing disruption,
and giving them time to migrate.


On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <[email protected]>
wrote:

> In next release it will be still compatible because we keep
> module “hadoop-input-format” but we make it deprecated and propose to use
> it through module “hadoop-format” and proxy class HadoopFormatIO (or
> HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read
> functionality by using MapReduce InputFormat or OutputFormat classes.
> Then, in future releases after next one, we can drop “hadoop-input-format”
>  since it was deprecated and we provided a time to move to new API. I think
> this is less painful way for user but most complicated for us if the final
> goal it to merge “hadoop-input-format” and “hadoop-output-format” together.
>
> On 7 Sep 2018, at 13:45, Robert Bradshaw <[email protected]> wrote:
>
> Agree about not impacting users. Perhaps I misread (3), isn't it fully
> backwards compatible as well?
>
> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi,
>>
>> in order to limit the impact for the existing users on Beam 2.x series,
>> I would go for (1).
>>
>> Regards
>> JB
>>
>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>> > Hello everyone,
>> >
>> > I’d like to discuss the following topic (see below) with community since
>> > the optimal solution is not clear for me.
>> >
>> > There is Java IO module, called “/hadoop-input-format/”, which allows to
>> > use MapReduce InputFormat implementations to read data from different
>> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>> > According to its name, it has only “Read" and it's missing “Write” part,
>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>> > this I created another module with this name. So, in the end, we will
>> > have two different modules “/hadoop-input-format/” and
>> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
>> > every existed Java IO, that we have, incapsulates Read and Write parts
>> > into one module. Additionally, we have “/hadoop-common/” and
>> > /“hadoop-file-system/” as other hadoop-related modules.
>> >
>> > Now I’m thinking how it will be better to organise all these Hadoop
>> > modules better. There are several options in my mind:
>> >
>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>> > “as it is”.
>> > Pros: no breaking changes, no additional work
>> > Cons: not logical for users to have the same IO in two different modules
>> > and with different names.
>> >
>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>> > keep the other Hadoop modules “as it is”.
>> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
>> > for users
>> > Cons: breaking changes for user code because of module/IO renaming
>> >
>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
>> > which will include new “write” functionality and be a proxy for old
>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
>> > become deprecated and be finally moved to common “/hadoop-format/”
>> > module in future releases. Keep the other Hadoop modules “as it is”.
>> > Pros: finally it will be only one module for hadoop MR format; changes
>> > are less painful for user
>> > Cons: hidden difficulties of implementation this strategy; a bit
>> > confusing for user
>> >
>> > 4) Add new module “/hadoop/” and move all already existed modules there
>> > as submodules (like we have for “/io/google-cloud-platform/”), merge
>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module.
>> > Pros: unification of all hadoop-related modules
>> > Cons: breaking changes for user code, additional complexity with deps
>> > and testing
>> >
>> > 5) Your suggestion?..
>> >
>> > My personal preferences are lying between 2 and 3 (if 3 is possible).
>> >
>> > I’m wondering if there were similar situations in Beam before and how it
>> > was finally resolved. If yes then probably we need to do here in similar
>> > way.
>> > Any suggestions/advices/comments would be very appreciated.
>> >
>> > Thanks,
>> > Alexey
>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to