[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379559714
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
 ##
 @@ -182,8 +183,11 @@ private String getMetadataKey(String action) {
   //read the avro blocks
   while (reader.hasNext()) {
 HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-// TODO If we can store additional metadata in datablock, we can 
skip parsing records
-// (such as startTime, endTime of records in the block)
+if (isDataOutOfRange(blk, filter)) {
 
 Review comment:
   I added a comment above which I think addresses this as well


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379559545
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java
 ##
 @@ -121,7 +121,7 @@ public long getLogBlockLength() {
* new enums at the end.
*/
   public enum HeaderMetadataType {
-INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE
+INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE, 
MIN_INSTANT_TIME, MAX_INSTANT_TIME
 
 Review comment:
   It's not about them being just generic, think of it this way - we are going 
to start dumping MetadataTypes into that one class. Now, different log blocks 
written in different parts of the code (archival, actual data, may be indexes, 
may be consolidated metadata) - we will have no idea what header type is used 
where without looking through the entire code base, having some abstraction 
here will be very helpful. I agree with your point about ordinals in nested 
enums, lets test it out and if this doesn't work, we should find another 
abstraction. let me know if that makes sense.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379558732
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
+  HoodieInstant minInstant = instants.get(0);
+  HoodieInstant maxInstant = instants.get(instants.size() - 1);
+  Map metadataMap = Maps.newHashMap();
+  metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString());
+  metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, 
minInstant.getTimestamp());
+  metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, 
maxInstant.getTimestamp());
+  this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), 
metadataMap));
+}
+  }
+
   private void writeToFile(Schema wrapperSchema, List records) 
throws Exception {
 
 Review comment:
   You are right that the file is closed after archiving all instants that 
qualify in that archiving process. But the next time an archival kicks in, it 
will check if the archival file is grown to a certain size (say 1GB), if not, 
it will append the next archival blocks to the same file..


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378431282
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java
 ##
 @@ -121,7 +121,7 @@ public long getLogBlockLength() {
* new enums at the end.
*/
   public enum HeaderMetadataType {
-INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE
+INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE, 
MIN_INSTANT_TIME, MAX_INSTANT_TIME
 
 Review comment:
   How about we don't add the MIN_INSTANT_TIME & MAX_INSTANT_TIME to the 
HeaderMetadataType but create an enum like   
   ```
   public enum HeaderMetadataType {
   INSTANT_TIME,
   TARGET_INSTANT_TIME,
   SCHEMA,
   COMMAND_BLOCK_TYPE;
   public enum ArchivedLogHeaderMetadataType {
 MIN_INSTANT_TIME;

   }
 }
   ```
   It looks like we are just overloading the  `HeaderMetadataType` with 
information that is pertaining only to archived files which is going to cause 
confusions when reading actual data blocks. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378427516
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
+  HoodieInstant minInstant = instants.get(0);
+  HoodieInstant maxInstant = instants.get(instants.size() - 1);
+  Map metadataMap = Maps.newHashMap();
+  metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString());
+  metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, 
minInstant.getTimestamp());
+  metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, 
maxInstant.getTimestamp());
+  this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), 
metadataMap));
+}
+  }
+
   private void writeToFile(Schema wrapperSchema, List records) 
throws Exception {
 
 Review comment:
   Move the writing of the header to this part, basically, augment the same 
DataBlock that is has the archived records with the metadata information that 
you want to push here, we already write the schema, just add more entries (like 
above) to the headers here. Then you will be able to read each block and then 
filter based on whether the block should be considered or not - this is more 
generic than adding an extra empty log block to track min/max over the entire 
file (which is hard since the file keeps growing anyways) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378425612
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
 ##
 @@ -182,8 +183,11 @@ private String getMetadataKey(String action) {
   //read the avro blocks
   while (reader.hasNext()) {
 HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-// TODO If we can store additional metadata in datablock, we can 
skip parsing records
-// (such as startTime, endTime of records in the block)
+if (isDataOutOfRange(blk, filter)) {
 
 Review comment:
   We only know that the data is out of range from the block and not from the 
whole file right ? So, this should be continue instead of break ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-12 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378424618
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
 
 Review comment:
   It is unclear what order the HoodieInstant.COMPARATOR sorts the hoodie 
instants to, can we put a comment here ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services