Re: [DISCUSS] Unification of Hadoop related IO modules

Alexey Romanenko Mon, 17 Dec 2018 09:03:47 -0800

Hi all,

I’d like to come back to this topic since most of the work has been done [1] 
and I wish to provide more details of current progress.


We added new module, called “hadoop-format”, which incorporates Read and Write 
parts for using Hadoop Mapreduce Format files. Old module “hadoop-input-format” 
keeps all public user API, but proxies all calls to new module, and will become 
deprecated starting from Beam 2.10. The implementation of “Read” part has moved 
into HadoopFormatIO and “Write" part was written from scratch. Unit tests are 
kept for both modules for the moment to guarantee that there is no regression. 

So, from the user perspective, everything should be as it was before, except 
that old IO becomes deprecated and the users have to migrate to new one after 
release 2.10.

What is left to do:
- Completely remove deprecated “hadoop-input-format” (at LTS or 3.0 release?..) 
[2]
- Add new “hadoop-format” ITs to run on Jenkins [3].

I also wanted to thank David Moravek and David Hrbacek who was working on 
batching/streaming support for “HadoopFormatIO.Write" and helped with review of 
other parts of this IO!
Thank you to Tim Robertson for your reviews as well!


[1] https://issues.apache.org/jira/browse/BEAM-5310 
<https://issues.apache.org/jira/browse/BEAM-5310>
[2] https://issues.apache.org/jira/browse/BEAM-6247 
<https://issues.apache.org/jira/browse/BEAM-6247>
[3] https://issues.apache.org/jira/browse/BEAM-6246 
<https://issues.apache.org/jira/browse/BEAM-6246>


> On 13 Sep 2018, at 20:26, Alexey Romanenko <[email protected]> wrote:
> 
> Robert, Chamikara,
> Yes, I agree that we need to give enough time for that. I’m fine to wait 
> until 3.0
> 
>> On 12 Sep 2018, at 19:27, Chamikara Jayalath <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> +1 for going with option 3.
>> 
>> On Wed, Sep 12, 2018 at 8:51 AM Robert Bradshaw <[email protected] 
>> <mailto:[email protected]>> wrote:
>> On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Thank you everybody for your feedback!
>> 
>> I think we can conclude that the most popular option, according to 
>> discussion above, is number 3. Not sure if we need to do a separate vote for 
>> that but, please, let me know if we need.
>> 
>> So, for now, I’d split a work into the following steps:
>> a) Create new module "hadoop-mapreduce-format” which implements support for 
>> MapReduce OutputFormat through new HadoopMapreduceFormat.Write class. For 
>> that, I just need to change a bit my already created PR 6306 
>> <https://github.com/apache/beam/pull/6306> that I added recently (renaming 
>> of module and class names).
>> b) Move all source and test classes of “hadoop-input-format” into the module 
>> "hadoop-mapreduce-format” and create new class HadoopMapreduceFormat.Read 
>> there to support MapReduce InputFormat.
>> c) Make old HadoopInputFormat.Read (in old “hadoop-input-format” module) 
>> deprecated and as proxy class to newly created HadoopMapreduceFormat.Read 
>> (to keep API compatibility)
>> 
>> Sounds like a great plan. 
>>  
>> These 3 steps should be performed and completed within one release cycle 
>> (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid 
>> having a huge commit if it will include step “a” as well.
>> 
>> Big +1. 
>>  
>> Then, in next release after:
>> d) Remove completely module “hadoop-input-format”  (approx. in 2.9). 
>> 
>> I don't think we'd be able to remove this until 3.0. 
>> 
>> I think we technically we can remove HadoopInputFormat before 3.0 since it's 
>> marked as experimental [1] but I'd suggest keeping it deprecated for at 
>> least two releases (3 months) before removal. Not sure if we have a policy 
>> on this.
>> 
>> [1] 
>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177
>>  
>> <https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177>
>> 
>>  
>> 
>> Other two Hadoop modules (common and file-system) we leave as it is.
>> 
>> I hope that this a correct summary of what community decided and I can move 
>> forward. 
>> 
>> Sounds good. 
>>  
>> Please, let me know if there any objections against this plan or other 
>> suggestions.
>> 
>> 
>>> On 11 Sep 2018, at 16:08, Thomas Weise <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> I'm in favor of a combination of 2) and 3): New module 
>>> "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify 
>>> what it is). Turn existing " hadoop-input-format" into a proxy for new 
>>> module for backward compatibility (marked deprecated and removed in next 
>>> major version).
>>> 
>>> I don't think everything "Hadoop" should be merged, purpose and usage is 
>>> just too different. As an example, the Hadoop file system abstraction 
>>> itself has implementation for multiple other systems and is not limited to 
>>> HDFS.
>>> 
>>> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Dharmendra,
>>> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you 
>>> can use FileIO or TextIO to write to HDFS, these IOs support different file 
>>> systems.
>>> 
>>>> On 11 Sep 2018, at 11:11, dharmendra pratap singh 
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> 
>>>> Hello Team,
>>>> Does this mean, as of today we can read from Hadoop FS but can't write to 
>>>> Hadoop FS using Beam HDFS API ?
>>>> 
>>>> Regards
>>>> Dharmendra
>>>> 
>>>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Hello everyone,
>>>> 
>>>> I’d like to discuss the following topic (see below) with community since 
>>>> the optimal solution is not clear for me.
>>>> 
>>>> There is Java IO module, called “hadoop-input-format”, which allows to use 
>>>> MapReduce InputFormat implementations to read data from different sources 
>>>> (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According 
>>>> to its name, it has only “Read" and it's missing “Write” part, so, I'm 
>>>> working on “hadoop-output-format” to support MapReduce OutputFormat (PR 
>>>> 6306 <https://github.com/apache/beam/pull/6306>). For this I created 
>>>> another module with this name. So, in the end, we will have two different 
>>>> modules “hadoop-input-format” and “hadoop-output-format” and it looks 
>>>> quite strange for me since, afaik, every existed Java IO, that we have, 
>>>> incapsulates Read and Write parts into one module. Additionally, we have 
>>>> “hadoop-common” and “hadoop-file-system” as other hadoop-related modules. 
>>>> 
>>>> Now I’m thinking how it will be better to organise all these Hadoop 
>>>> modules better. There are several options in my mind: 
>>>> 
>>>> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as 
>>>> it is”. 
>>>>    Pros: no breaking changes, no additional work 
>>>>    Cons: not logical for users to have the same IO in two different 
>>>> modules and with different names.
>>>> 
>>>> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module 
>>>> called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other 
>>>> Hadoop modules “as it is”.
>>>>    Pros: to have InputFormat/OutputFormat in one IO module which is 
>>>> logical for users
>>>>    Cons: breaking changes for user code because of module/IO renaming 
>>>> 
>>>> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which 
>>>> will include new “write” functionality and be a proxy for old 
>>>> “hadoop-input-format”. In its turn, “hadoop-input-format” should become 
>>>> deprecated and be finally moved to common “hadoop-format” module in future 
>>>> releases. Keep the other Hadoop modules “as it is”.
>>>>    Pros: finally it will be only one module for hadoop MR format; changes 
>>>> are less painful for user
>>>>    Cons: hidden difficulties of implementation this strategy; a bit 
>>>> confusing for user 
>>>> 
>>>> 4) Add new module “hadoop” and move all already existed modules there as 
>>>> submodules (like we have for “io/google-cloud-platform”), merge 
>>>> “hadoop-input-format” and “hadoop-output-format” into one module. 
>>>>    Pros: unification of all hadoop-related modules
>>>>    Cons: breaking changes for user code, additional complexity with deps 
>>>> and testing
>>>> 
>>>> 5) Your suggestion?..
>>>> 
>>>> My personal preferences are lying between 2 and 3 (if 3 is possible). 
>>>> 
>>>> I’m wondering if there were similar situations in Beam before and how it 
>>>> was finally resolved. If yes then probably we need to do here in similar 
>>>> way.
>>>> Any suggestions/advices/comments would be very appreciated.
>>>> 
>>>> Thanks,
>>>> Alexey
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to