[DISCUSS] Unification of Hadoop related IO modules

Alexey Romanenko Thu, 06 Sep 2018 08:25:20 -0700

Hello everyone,

I’d like to discuss the following topic (see below) with community since the 
optimal solution is not clear for me.


There is Java IO module, called “hadoop-input-format”, which allows to use 
MapReduce InputFormat implementations to read data from different sources (for 
example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to its 
name, it has only “Read" and it's missing “Write” part, so, I'm working on 
“hadoop-output-format” to support MapReduce OutputFormat (PR 6306 
<https://github.com/apache/beam/pull/6306>). For this I created another module 
with this name. So, in the end, we will have two different modules 
“hadoop-input-format” and “hadoop-output-format” and it looks quite strange for 
me since, afaik, every existed Java IO, that we have, incapsulates Read and 
Write parts into one module. Additionally, we have “hadoop-common” and 
“hadoop-file-system” as other hadoop-related modules. 

Now I’m thinking how it will be better to organise all these Hadoop modules 
better. There are several options in my mind: 

1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it 
is”. 
        Pros: no breaking changes, no additional work 
        Cons: not logical for users to have the same IO in two different 
modules and with different names.

2) Merge “hadoop-input-format” and “hadoop-output-format” into one module 
called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other 
Hadoop modules “as it is”.
        Pros: to have InputFormat/OutputFormat in one IO module which is 
logical for users
        Cons: breaking changes for user code because of module/IO renaming 

3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will 
include new “write” functionality and be a proxy for old “hadoop-input-format”. 
In its turn, “hadoop-input-format” should become deprecated and be finally 
moved to common “hadoop-format” module in future releases. Keep the other 
Hadoop modules “as it is”.
        Pros: finally it will be only one module for hadoop MR format; changes 
are less painful for user
        Cons: hidden difficulties of implementation this strategy; a bit 
confusing for user 

4) Add new module “hadoop” and move all already existed modules there as 
submodules (like we have for “io/google-cloud-platform”), merge 
“hadoop-input-format” and “hadoop-output-format” into one module. 
        Pros: unification of all hadoop-related modules
        Cons: breaking changes for user code, additional complexity with deps 
and testing

5) Your suggestion?..

My personal preferences are lying between 2 and 3 (if 3 is possible). 

I’m wondering if there were similar situations in Beam before and how it was 
finally resolved. If yes then probably we need to do here in similar way.
Any suggestions/advices/comments would be very appreciated.

Thanks,
Alexey

[DISCUSS] Unification of Hadoop related IO modules

Reply via email to