+1 for going with option 3. On Wed, Sep 12, 2018 at 8:51 AM Robert Bradshaw <[email protected]> wrote:
> On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <[email protected]> > wrote: > >> Thank you everybody for your feedback! >> >> I think we can conclude that the most popular option, according to >> discussion above, is number 3. Not sure if we need to do a separate vote >> for that but, please, let me know if we need. >> >> So, for now, I’d split a work into the following steps: >> a) Create new module "*hadoop-mapreduce-format*” which implements >> support for MapReduce OutputFormat through new *HadoopMapreduceFormat.Write >> *class*. *For that, I just need to change a bit my already created PR >> 6306 <https://github.com/apache/beam/pull/6306> that I added >> recently (renaming of module and class names). >> b) Move all source and test classes of “hadoop-input-format” into the >> module "hadoop-mapreduce-format” and create new class >> *HadoopMapreduceFormat.Read >> *there to support MapReduce InputFormat. >> c) Make old *HadoopInputFormat.Read *(in old “*hadoop-input-format*” >> module) deprecated and as proxy class to newly created >> *HadoopMapreduceFormat.Read >> *(to keep API compatibility) >> > > Sounds like a great plan. > > >> These 3 steps should be performed and completed within one release cycle >> (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid >> having a huge commit if it will include step “a” as well. >> > > Big +1. > > >> Then, in next release after: >> d) Remove completely module “hadoop-input-format” (approx. in 2.9). >> > > I don't think we'd be able to remove this until 3.0. > I think we technically we can remove HadoopInputFormat before 3.0 since it's marked as experimental [1] but I'd suggest keeping it deprecated for at least two releases (3 months) before removal. Not sure if we have a policy on this. [1] https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177 > > Other two Hadoop modules (*common* and *file-system*) we leave as it is. >> >> I hope that this a correct summary of what community decided and I can >> move forward. >> > > Sounds good. > > >> Please, let me know if there any objections against this plan or other >> suggestions. >> >> >> On 11 Sep 2018, at 16:08, Thomas Weise <[email protected]> wrote: >> >> I'm in favor of a combination of 2) and 3): New module >> "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify >> what it is). Turn existing " hadoop-input-format" into a proxy for new >> module for backward compatibility (marked deprecated and removed in next >> major version). >> >> I don't think everything "Hadoop" should be merged, purpose and usage is >> just too different. As an example, the Hadoop file system abstraction >> itself has implementation for multiple other systems and is not limited to >> HDFS. >> >> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko < >> [email protected]> wrote: >> >>> Dharmendra, >>> For now, you can’t write with Hadoop MapReduce OutputFormat. However, >>> you can use FileIO or TextIO to write to HDFS, these IOs support different >>> file systems. >>> >>> On 11 Sep 2018, at 11:11, dharmendra pratap singh < >>> [email protected]> wrote: >>> >>> Hello Team, >>> Does this mean, as of today we can read from Hadoop FS but can't write >>> to Hadoop FS using Beam HDFS API ? >>> >>> Regards >>> Dharmendra >>> >>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko < >>> [email protected]> wrote: >>> >>>> Hello everyone, >>>> >>>> I’d like to discuss the following topic (see below) with community >>>> since the optimal solution is not clear for me. >>>> >>>> There is Java IO module, called “*hadoop-input-format*”, which allows >>>> to use MapReduce InputFormat implementations to read data from different >>>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). >>>> According to its name, it has only “Read" and it's missing “Write” part, >>>> so, I'm working on “*hadoop-output-format*” to support MapReduce >>>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For >>>> this I created another module with this name. So, in the end, we will have >>>> two different modules “*hadoop-input-format*” and “ >>>> *hadoop-output-format*” and it looks quite strange for me since, >>>> afaik, every existed Java IO, that we have, incapsulates Read and Write >>>> parts into one module. Additionally, we have “*hadoop-common*” and >>>> *“hadoop-file-system*” as other hadoop-related modules. >>>> >>>> Now I’m thinking how it will be better to organise all these Hadoop >>>> modules better. There are several options in my mind: >>>> >>>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop >>>> modules “as it is”. >>>> Pros: no breaking changes, no additional work >>>> Cons: not logical for users to have the same IO in two different >>>> modules and with different names. >>>> >>>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one >>>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”, >>>> keep the other Hadoop modules “as it is”. >>>> Pros: to have InputFormat/OutputFormat in one IO module which is >>>> logical for users >>>> Cons: breaking changes for user code because of module/IO renaming >>>> >>>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”) >>>> which will include new “write” functionality and be a proxy for old “ >>>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should >>>> become deprecated and be finally moved to common “*hadoop-format*” >>>> module in future releases. Keep the other Hadoop modules “as it is”. >>>> Pros: finally it will be only one module for hadoop MR format; changes >>>> are less painful for user >>>> Cons: hidden difficulties of implementation this strategy; a bit >>>> confusing for user >>>> >>>> 4) Add new module “*hadoop*” and move all already existed modules >>>> there as submodules (like we have for “*io/google-cloud-platform*”), >>>> merge “*hadoop-input-format*” and “*hadoop-output-format*” into one >>>> module. >>>> Pros: unification of all hadoop-related modules >>>> Cons: breaking changes for user code, additional complexity with deps >>>> and testing >>>> >>>> 5) Your suggestion?.. >>>> >>>> My personal preferences are lying between 2 and 3 (if 3 is possible). >>>> >>>> I’m wondering if there were similar situations in Beam before and how >>>> it was finally resolved. If yes then probably we need to do here in similar >>>> way. >>>> Any suggestions/advices/comments would be very appreciated. >>>> >>>> Thanks, >>>> Alexey >>>> >>> >>> >>
