guilload opened a new pull request #1104:
URL: https://github.com/apache/iceberg/pull/1104


   Hello,
   
   This is another attempt at implementing a MR v1 input format (mapred) for 
Iceberg. For context, when I started working on this PR, #933 had been inactive 
for about a month. There's been new activity since then, but since I'm finished 
I thought I'd still push this branch to offer an alternative.
   
   In this PR, I've tried to address the main concerns raised in #933, mostly 
about reusing the input format, record reader, and split classes implemented 
for MR v2.
   
   I've also modified the input format test suite to be able to run against 
both MR input formats.
   
   A Hive input format can easily be built on this of this MR v1 input format. 
The `IcebergSplit` class can be wrapped into a `FileSplit` if necessary (see a 
comment from @massdosage on #933). I believe the nice test suite from #933 
could also be reused:
   
   ```java
   public class IcebergWritable extends Container<Record> {}
   
   public class IcebergInputFormat extends 
MapredIcebergInputFormat<IcebergWritable, Record> implements 
CombineHiveInputFormat.AvoidSplitCombination {
   
       @Override
       public boolean shouldSkipCombine(Path path, Configuration conf) {
           return true;
       }
   
   }
   ```
   
   The PR is split into multiple commits:
   
   Refactor:
   bd4c090: Move ConfigBuilder and InMemoryDataModel out of IcebergInputFormat
   ae01439: Move IcebergRecordReader and IcebergSplit out of IcebergInputFormat
   4805ead: Refactor TestIcebergInputFormat, mostly factoring out duplicate code
   
   Rename:
   547ebad: Rename TestIcebergInputFormat class: `TestIcebergInputFormat` -> 
`TestIcebergInputFormatS`
   
   Feature:
   4253ef9: Implement MR v1 (mapred) input format, wrapping the v2 classes and 
introducing a `Container` class to deal with the cumbersome MR v1 API 
(`createValue`)
   
   @rdblue @rdsr 
   @cmathiesen @massdosage 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to