Re: [DISCUSS] Unification of Hadoop related IO modules

David Morávek Fri, 07 Sep 2018 12:36:00 -0700

+1 for option 3 as it should be the least painful option for the current users


D.

Sent from my iPhone

> On 7 Sep 2018, at 19:50, Tim <[email protected]> wrote:
> 
> Another +1 for option 3 (and preference of HadoopFormatIO naming).
> 
> Thanks Alexey,
> 
> Tim
> 
> 
>> On 7 Sep 2018, at 19:13, Andrew Pilloud <[email protected]> wrote:
>> 
>> +1 for option 3. That approach will keep the mapping clean if SQL supports 
>> this IO. It would be good to put the proxy in the old module and move the 
>> implementation now. That way the old module can be easily deleted when the 
>> time comes.
>> 
>> Andrew
>> 
>>> On Fri, Sep 7, 2018 at 6:15 AM Robert Bradshaw <[email protected]> wrote:
>>> OK, good, that's what I thought. So I stick by (3) which
>>> 
>>> 1) Cleans up the library for all future uses (hopefully the majority of all 
>>> users :). 
>>> 2) Is fully backwards compatible for existing users, minimizing disruption, 
>>> and giving them time to migrate. 
>>> 
>>>> On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <[email protected]> 
>>>> wrote:
>>>> In next release it will be still compatible because we keep module 
>>>> “hadoop-input-format” but we make it deprecated and propose to use it 
>>>> through module “hadoop-format” and proxy class HadoopFormatIO (or 
>>>> HadoopMapReduceFormatIO, whatever we name it) which will provide 
>>>> Write/Read functionality by using MapReduce InputFormat or OutputFormat 
>>>> classes. 
>>>> Then, in future releases after next one, we can drop “hadoop-input-format” 
>>>>  since it was deprecated and we provided a time to move to new API. I 
>>>> think this is less painful way for user but most complicated for us if the 
>>>> final goal it to merge “hadoop-input-format” and “hadoop-output-format” 
>>>> together.
>>>> 
>>>>> On 7 Sep 2018, at 13:45, Robert Bradshaw <[email protected]> wrote:
>>>>> 
>>>>> Agree about not impacting users. Perhaps I misread (3), isn't it fully 
>>>>> backwards compatible as well? 
>>>>> 
>>>>> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <[email protected]> 
>>>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> in order to limit the impact for the existing users on Beam 2.x series,
>>>>>> I would go for (1).
>>>>>> 
>>>>>> Regards
>>>>>> JB
>>>>>> 
>>>>>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>>>>>> > Hello everyone,
>>>>>> > 
>>>>>> > I’d like to discuss the following topic (see below) with community 
>>>>>> > since
>>>>>> > the optimal solution is not clear for me.
>>>>>> > 
>>>>>> > There is Java IO module, called “/hadoop-input-format/”, which allows 
>>>>>> > to
>>>>>> > use MapReduce InputFormat implementations to read data from different
>>>>>> > sources (for example, 
>>>>>> > org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>>>>> > According to its name, it has only “Read" and it's missing “Write” 
>>>>>> > part,
>>>>>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>>>>>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>>>>> > this I created another module with this name. So, in the end, we will
>>>>>> > have two different modules “/hadoop-input-format/” and
>>>>>> > “/hadoop-output-format/” and it looks quite strange for me since, 
>>>>>> > afaik,
>>>>>> > every existed Java IO, that we have, incapsulates Read and Write parts
>>>>>> > into one module. Additionally, we have “/hadoop-common/” and
>>>>>> > /“hadoop-file-system/” as other hadoop-related modules. 
>>>>>> > 
>>>>>> > Now I’m thinking how it will be better to organise all these Hadoop
>>>>>> > modules better. There are several options in my mind: 
>>>>>> > 
>>>>>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>>>>>> > “as it is”. 
>>>>>> > Pros: no breaking changes, no additional work 
>>>>>> > Cons: not logical for users to have the same IO in two different 
>>>>>> > modules
>>>>>> > and with different names.
>>>>>> > 
>>>>>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>>>>>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>>>>>> > keep the other Hadoop modules “as it is”.
>>>>>> > Pros: to have InputFormat/OutputFormat in one IO module which is 
>>>>>> > logical
>>>>>> > for users
>>>>>> > Cons: breaking changes for user code because of module/IO renaming 
>>>>>> > 
>>>>>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
>>>>>> > which will include new “write” functionality and be a proxy for old
>>>>>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
>>>>>> > become deprecated and be finally moved to common “/hadoop-format/”
>>>>>> > module in future releases. Keep the other Hadoop modules “as it is”.
>>>>>> > Pros: finally it will be only one module for hadoop MR format; changes
>>>>>> > are less painful for user
>>>>>> > Cons: hidden difficulties of implementation this strategy; a bit
>>>>>> > confusing for user 
>>>>>> > 
>>>>>> > 4) Add new module “/hadoop/” and move all already existed modules there
>>>>>> > as submodules (like we have for “/io/google-cloud-platform/”), merge
>>>>>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module. 
>>>>>> > Pros: unification of all hadoop-related modules
>>>>>> > Cons: breaking changes for user code, additional complexity with deps
>>>>>> > and testing
>>>>>> > 
>>>>>> > 5) Your suggestion?..
>>>>>> > 
>>>>>> > My personal preferences are lying between 2 and 3 (if 3 is possible). 
>>>>>> > 
>>>>>> > I’m wondering if there were similar situations in Beam before and how 
>>>>>> > it
>>>>>> > was finally resolved. If yes then probably we need to do here in 
>>>>>> > similar
>>>>>> > way.
>>>>>> > Any suggestions/advices/comments would be very appreciated.
>>>>>> > 
>>>>>> > Thanks,
>>>>>> > Alexey
>>>>>> 
>>>>>> -- 
>>>>>> Jean-Baptiste Onofré
>>>>>> [email protected]
>>>>>> http://blog.nanthrax.net
>>>>>> Talend - http://www.talend.com
>>>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to