Re: [DISCUSS] Unification of Hadoop related IO modules

Chamikara Jayalath Wed, 12 Sep 2018 10:28:08 -0700

+1 for going with option 3.

On Wed, Sep 12, 2018 at 8:51 AM Robert Bradshaw <[email protected]> wrote:


> On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <[email protected]>
> wrote:
>
>> Thank you everybody for your feedback!
>>
>> I think we can conclude that the most popular option, according to
>> discussion above, is number 3. Not sure if we need to do a separate vote
>> for that but, please, let me know if we need.
>>
>> So, for now, I’d split a work into the following steps:
>> a) Create new module "*hadoop-mapreduce-format*” which implements
>> support for MapReduce OutputFormat through new *HadoopMapreduceFormat.Write
>> *class*. *For that, I just need to change a bit my already created PR
>> 6306 <https://github.com/apache/beam/pull/6306> that I added
>> recently (renaming of module and class names).
>> b) Move all source and test classes of “hadoop-input-format” into the
>> module "hadoop-mapreduce-format” and create new class 
>> *HadoopMapreduceFormat.Read
>> *there to support MapReduce InputFormat.
>> c) Make old *HadoopInputFormat.Read *(in old “*hadoop-input-format*”
>> module) deprecated and as proxy class to newly created 
>> *HadoopMapreduceFormat.Read
>> *(to keep API compatibility)
>>
>
> Sounds like a great plan.
>
>
>> These 3 steps should be performed and completed within one release cycle
>> (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid
>> having a huge commit if it will include step “a” as well.
>>
>
> Big +1.
>
>
>> Then, in next release after:
>> d) Remove completely module “hadoop-input-format”  (approx. in 2.9).
>>
>
> I don't think we'd be able to remove this until 3.0.
>

I think we technically we can remove HadoopInputFormat before 3.0 since
it's marked as experimental [1] but I'd suggest keeping it deprecated for
at least two releases (3 months) before removal. Not sure if we have a
policy on this.

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177



>
> Other two Hadoop modules (*common* and *file-system*) we leave as it is.
>>
>> I hope that this a correct summary of what community decided and I can
>> move forward.
>>
>
> Sounds good.
>
>
>> Please, let me know if there any objections against this plan or other
>> suggestions.
>>
>>
>> On 11 Sep 2018, at 16:08, Thomas Weise <[email protected]> wrote:
>>
>> I'm in favor of a combination of 2) and 3): New module
>> "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify
>> what it is). Turn existing " hadoop-input-format" into a proxy for new
>> module for backward compatibility (marked deprecated and removed in next
>> major version).
>>
>> I don't think everything "Hadoop" should be merged, purpose and usage is
>> just too different. As an example, the Hadoop file system abstraction
>> itself has implementation for multiple other systems and is not limited to
>> HDFS.
>>
>> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <
>> [email protected]> wrote:
>>
>>> Dharmendra,
>>> For now, you can’t write with Hadoop MapReduce OutputFormat. However,
>>> you can use FileIO or TextIO to write to HDFS, these IOs support different
>>> file systems.
>>>
>>> On 11 Sep 2018, at 11:11, dharmendra pratap singh <
>>> [email protected]> wrote:
>>>
>>> Hello Team,
>>> Does this mean, as of today we can read from Hadoop FS but can't write
>>> to Hadoop FS using Beam HDFS API ?
>>>
>>> Regards
>>> Dharmendra
>>>
>>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <
>>> [email protected]> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> I’d like to discuss the following topic (see below) with community
>>>> since the optimal solution is not clear for me.
>>>>
>>>> There is Java IO module, called “*hadoop-input-format*”, which allows
>>>> to use MapReduce InputFormat implementations to read data from different
>>>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>>> According to its name, it has only “Read" and it's missing “Write” part,
>>>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>>>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>>> this I created another module with this name. So, in the end, we will have
>>>> two different modules “*hadoop-input-format*” and “
>>>> *hadoop-output-format*” and it looks quite strange for me since,
>>>> afaik, every existed Java IO, that we have, incapsulates Read and Write
>>>> parts into one module. Additionally, we have “*hadoop-common*” and
>>>> *“hadoop-file-system*” as other hadoop-related modules.
>>>>
>>>> Now I’m thinking how it will be better to organise all these Hadoop
>>>> modules better. There are several options in my mind:
>>>>
>>>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop
>>>> modules “as it is”.
>>>> Pros: no breaking changes, no additional work
>>>> Cons: not logical for users to have the same IO in two different
>>>> modules and with different names.
>>>>
>>>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>>>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>>>> keep the other Hadoop modules “as it is”.
>>>> Pros: to have InputFormat/OutputFormat in one IO module which is
>>>> logical for users
>>>> Cons: breaking changes for user code because of module/IO renaming
>>>>
>>>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>>>> which will include new “write” functionality and be a proxy for old “
>>>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>>>> become deprecated and be finally moved to common “*hadoop-format*”
>>>> module in future releases. Keep the other Hadoop modules “as it is”.
>>>> Pros: finally it will be only one module for hadoop MR format; changes
>>>> are less painful for user
>>>> Cons: hidden difficulties of implementation this strategy; a bit
>>>> confusing for user
>>>>
>>>> 4) Add new module “*hadoop*” and move all already existed modules
>>>> there as submodules (like we have for “*io/google-cloud-platform*”),
>>>> merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>>>> module.
>>>> Pros: unification of all hadoop-related modules
>>>> Cons: breaking changes for user code, additional complexity with deps
>>>> and testing
>>>>
>>>> 5) Your suggestion?..
>>>>
>>>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>>>
>>>> I’m wondering if there were similar situations in Beam before and how
>>>> it was finally resolved. If yes then probably we need to do here in similar
>>>> way.
>>>> Any suggestions/advices/comments would be very appreciated.
>>>>
>>>> Thanks,
>>>> Alexey
>>>>
>>>
>>>
>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to