[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379559714 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java ## @@ -182,8 +183,11 @@ private String getMetadataKey(String action) { //read the avro blocks while (reader.hasNext()) { HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next(); -// TODO If we can store additional metadata in datablock, we can skip parsing records -// (such as startTime, endTime of records in the block) +if (isDataOutOfRange(blk, filter)) { Review comment: I added a comment above which I think addresses this as well This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379559545 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java ## @@ -121,7 +121,7 @@ public long getLogBlockLength() { * new enums at the end. */ public enum HeaderMetadataType { -INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE +INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE, MIN_INSTANT_TIME, MAX_INSTANT_TIME Review comment: It's not about them being just generic, think of it this way - we are going to start dumping MetadataTypes into that one class. Now, different log blocks written in different parts of the code (archival, actual data, may be indexes, may be consolidated metadata) - we will have no idea what header type is used where without looking through the entire code base, having some abstraction here will be very helpful. I agree with your point about ordinals in nested enums, lets test it out and if this doesn't work, we should find another abstraction. let me know if that makes sense. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379558732 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java ## @@ -268,6 +270,19 @@ public Path getArchiveFilePath() { return archiveFilePath; } + private void writeHeaderBlock(Schema wrapperSchema, List instants) throws Exception { +if (!instants.isEmpty()) { + Collections.sort(instants, HoodieInstant.COMPARATOR); + HoodieInstant minInstant = instants.get(0); + HoodieInstant maxInstant = instants.get(instants.size() - 1); + Map metadataMap = Maps.newHashMap(); + metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString()); + metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, minInstant.getTimestamp()); + metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, maxInstant.getTimestamp()); + this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), metadataMap)); +} + } + private void writeToFile(Schema wrapperSchema, List records) throws Exception { Review comment: You are right that the file is closed after archiving all instants that qualify in that archiving process. But the next time an archival kicks in, it will check if the archival file is grown to a certain size (say 1GB), if not, it will append the next archival blocks to the same file.. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378431282 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java ## @@ -121,7 +121,7 @@ public long getLogBlockLength() { * new enums at the end. */ public enum HeaderMetadataType { -INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE +INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE, MIN_INSTANT_TIME, MAX_INSTANT_TIME Review comment: How about we don't add the MIN_INSTANT_TIME & MAX_INSTANT_TIME to the HeaderMetadataType but create an enum like ``` public enum HeaderMetadataType { INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE; public enum ArchivedLogHeaderMetadataType { MIN_INSTANT_TIME; } } ``` It looks like we are just overloading the `HeaderMetadataType` with information that is pertaining only to archived files which is going to cause confusions when reading actual data blocks. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378427516 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java ## @@ -268,6 +270,19 @@ public Path getArchiveFilePath() { return archiveFilePath; } + private void writeHeaderBlock(Schema wrapperSchema, List instants) throws Exception { +if (!instants.isEmpty()) { + Collections.sort(instants, HoodieInstant.COMPARATOR); + HoodieInstant minInstant = instants.get(0); + HoodieInstant maxInstant = instants.get(instants.size() - 1); + Map metadataMap = Maps.newHashMap(); + metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString()); + metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, minInstant.getTimestamp()); + metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, maxInstant.getTimestamp()); + this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), metadataMap)); +} + } + private void writeToFile(Schema wrapperSchema, List records) throws Exception { Review comment: Move the writing of the header to this part, basically, augment the same DataBlock that is has the archived records with the metadata information that you want to push here, we already write the schema, just add more entries (like above) to the headers here. Then you will be able to read each block and then filter based on whether the block should be considered or not - this is more generic than adding an extra empty log block to track min/max over the entire file (which is hard since the file keeps growing anyways) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378425612 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java ## @@ -182,8 +183,11 @@ private String getMetadataKey(String action) { //read the avro blocks while (reader.hasNext()) { HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next(); -// TODO If we can store additional metadata in datablock, we can skip parsing records -// (such as startTime, endTime of records in the block) +if (isDataOutOfRange(blk, filter)) { Review comment: We only know that the data is out of range from the block and not from the whole file right ? So, this should be continue instead of break ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r378424618 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java ## @@ -268,6 +270,19 @@ public Path getArchiveFilePath() { return archiveFilePath; } + private void writeHeaderBlock(Schema wrapperSchema, List instants) throws Exception { +if (!instants.isEmpty()) { + Collections.sort(instants, HoodieInstant.COMPARATOR); Review comment: It is unclear what order the HoodieInstant.COMPARATOR sorts the hoodie instants to, can we put a comment here ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services