[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7021: [Minor] fix multi deser avro payload

2022-11-17 Thread GitBox


alexeykudinkin commented on code in PR #7021:
URL: https://github.com/apache/hudi/pull/7021#discussion_r1025888585


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -364,7 +364,8 @@ private void processAppendResult(AppendResult result, 
List recordL
   updateWriteStatus(stat, result);
 }
 
-if (config.isMetadataColumnStatsIndexEnabled()) {
+// TODO MetadataColumnStatsIndex for spark record
+if (config.isMetadataColumnStatsIndexEnabled() && 
recordMerger.getRecordType() == HoodieRecordType.AVRO) {

Review Comment:
   Let's create a ticket for this. We need to fix this before 0.13



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -215,18 +216,16 @@ private Option 
prepareRecord(HoodieRecord hoodieRecord) {
   // If the format can not record the operation field, nullify the DELETE 
payload manually.
   boolean nullifyPayload = 
HoodieOperation.isDelete(hoodieRecord.getOperation()) && 
!config.allowOperationMetadataField();
   
recordProperties.put(HoodiePayloadProps.PAYLOAD_IS_UPDATE_RECORD_FOR_MOR, 
String.valueOf(isUpdateRecord));
-  Option finalRecord = Option.empty();
-  if (!nullifyPayload && !hoodieRecord.isDelete(tableSchema, 
recordProperties)) {
-if (hoodieRecord.shouldIgnore(tableSchema, recordProperties)) {
-  return Option.of(hoodieRecord);
+  Option finalRecord = nullifyPayload ? Option.empty() : 
Option.of(hoodieRecord.deserialization(tableSchema, recordProperties));
+  // Check for delete
+  if (finalRecord.isPresent() && !finalRecord.get().isDelete(tableSchema, 
recordProperties)) {
+// Check for ignore ExpressionPayload
+if (finalRecord.get().shouldIgnore(tableSchema, recordProperties)) {
+  return finalRecord;

Review Comment:
   This is actually incorrect -- this will delete the record



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7021: [Minor] fix multi deser avro payload

2022-11-16 Thread GitBox


alexeykudinkin commented on code in PR #7021:
URL: https://github.com/apache/hudi/pull/7021#discussion_r1024783255


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecord.java:
##
@@ -133,23 +143,29 @@ public HoodieRecord updateMetadataValues(Schema 
recordSchema, Properties props,
   }
 });
 
-return new HoodieAvroRecord<>(getKey(), new 
RewriteAvroPayload(avroRecordPayload), getOperation());
+return new HoodieAvroRecord<>(getKey(), new 
RewriteAvroPayload(avroRecordPayload), getOperation(), this.currentLocation, 
this.newLocation);
   }
 
   @Override
   public HoodieRecord truncateRecordKey(Schema recordSchema, Properties props, 
String keyFieldName) throws IOException {
 GenericRecord avroRecordPayload = (GenericRecord) 
getData().getInsertValue(recordSchema, props).get();
 avroRecordPayload.put(keyFieldName, StringUtils.EMPTY_STRING);
-return new HoodieAvroRecord<>(getKey(), new 
RewriteAvroPayload(avroRecordPayload), getOperation());
+return new HoodieAvroRecord<>(getKey(), new 
RewriteAvroPayload(avroRecordPayload), getOperation(), this.currentLocation, 
this.newLocation);
   }
 
   @Override
   public boolean isDelete(Schema recordSchema, Properties props) throws 
IOException {
+if (!(data instanceof HoodieAvroInsertValuePayload) && !(data instanceof 
RewriteAvroPayload)) {
+  throw new HoodieException("We should deserialization before calling 
isDelete");

Review Comment:
   Why do we want to enforce this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7021: [Minor] fix multi deser avro payload

2022-10-25 Thread GitBox


alexeykudinkin commented on code in PR #7021:
URL: https://github.com/apache/hudi/pull/7021#discussion_r1004805080


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecord.java:
##
@@ -189,14 +193,37 @@ public Option> getMetadata() {
 
   @Override
   public Option toIndexedRecord(Schema recordSchema, 
Properties props) throws IOException {
-Option avroData = getData().getInsertValue(recordSchema, 
props);
+Option avroData = getCachedDeserializedRecord(recordSchema, 
props);
 if (avroData.isPresent()) {
   return Option.of(new HoodieAvroIndexedRecord(avroData.get()));
 } else {
   return Option.empty();
 }
   }
 
+  private Option getCachedDeserializedRecord(Schema 
recordSchema, Properties props) throws IOException {
+// Check schema identical
+if (this.cachedDeserializedRecord != null && 
this.cachedDeserializedRecord.isPresent()
+&& !compareSchema(cachedDeserializedRecord.get().getSchema(), 
recordSchema)) {
+  this.cachedDeserializedRecord = null;
+}
+if (this.cachedDeserializedRecord == null) {
+  this.cachedDeserializedRecord = this.data.getInsertValue(recordSchema, 
props);
+}
+return this.cachedDeserializedRecord;
+  }
+
+  private static Boolean compareSchema(Schema left, Schema right) {
+if (left == null || right == null) {
+  return false;
+}
+Pair schemaPair = Pair.of(left, right);
+if (!SCHEMA_COMPARE_MAP.containsKey(schemaPair)) {

Review Comment:
   You're right it's O(1), but i'm not talking about asymptotic complexity i'm 
talking about just the cost of looking up in the HashMap: you use `Pair` as a key and that means that we will have to do the equality check on 
the HM entry when we look it up, and in that case it would have to compare 2 
schemas to retrieve the previous result of their comparison which obviously 
doesn't make sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7021: [Minor] fix multi deser avro payload

2022-10-24 Thread GitBox


alexeykudinkin commented on code in PR #7021:
URL: https://github.com/apache/hudi/pull/7021#discussion_r1003672081


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecord.java:
##
@@ -189,14 +193,37 @@ public Option> getMetadata() {
 
   @Override
   public Option toIndexedRecord(Schema recordSchema, 
Properties props) throws IOException {
-Option avroData = getData().getInsertValue(recordSchema, 
props);
+Option avroData = getCachedDeserializedRecord(recordSchema, 
props);
 if (avroData.isPresent()) {
   return Option.of(new HoodieAvroIndexedRecord(avroData.get()));
 } else {
   return Option.empty();
 }
   }
 
+  private Option getCachedDeserializedRecord(Schema 
recordSchema, Properties props) throws IOException {
+// Check schema identical
+if (this.cachedDeserializedRecord != null && 
this.cachedDeserializedRecord.isPresent()
+&& !compareSchema(cachedDeserializedRecord.get().getSchema(), 
recordSchema)) {
+  this.cachedDeserializedRecord = null;
+}
+if (this.cachedDeserializedRecord == null) {
+  this.cachedDeserializedRecord = this.data.getInsertValue(recordSchema, 
props);
+}
+return this.cachedDeserializedRecord;
+  }
+
+  private static Boolean compareSchema(Schema left, Schema right) {
+if (left == null || right == null) {
+  return false;
+}
+Pair schemaPair = Pair.of(left, right);
+if (!SCHEMA_COMPARE_MAP.containsKey(schemaPair)) {

Review Comment:
   I don't think this setup makes sense -- complexity of comparing the 
`Pair(leftSchema, rightSchema)` is def more than just comparing 2 schemas.
   
   Instead, let's just keep `Pair` cached and compare it 
whenever we retrieve



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7021: [Minor] fix multi deser avro payload

2022-10-21 Thread GitBox


alexeykudinkin commented on code in PR #7021:
URL: https://github.com/apache/hudi/pull/7021#discussion_r1002268684


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecord.java:
##
@@ -189,14 +191,21 @@ public Option> getMetadata() {
 
   @Override
   public Option toIndexedRecord(Schema recordSchema, 
Properties props) throws IOException {
-Option avroData = getData().getInsertValue(recordSchema, 
props);
+Option avroData = getDeserData(recordSchema, props);
 if (avroData.isPresent()) {
   return Option.of(new HoodieAvroIndexedRecord(avroData.get()));
 } else {
   return Option.empty();
 }
   }
 
+  private Option getDeserData(Schema recordSchema, Properties 
props) throws IOException {
+if (deserRecord == null) {
+  this.deserRecord = this.data.getInsertValue(recordSchema, props);
+}
+return this.deserRecord;

Review Comment:
   We will also have to do following 
- Keep the schema this record has been deserialized with, and 
- Make sure schemas are identical (comparing schemas w/ equals will kill 
perf-gain so we should compare them as strings)



##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecord.java:
##
@@ -39,6 +39,8 @@
 
 public class HoodieAvroRecord extends 
HoodieRecord {
 
+  private transient Option deserRecord = null;

Review Comment:
   nit: `cachedDeserializedRecord` 



##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecord.java:
##
@@ -189,14 +191,21 @@ public Option> getMetadata() {
 
   @Override
   public Option toIndexedRecord(Schema recordSchema, 
Properties props) throws IOException {
-Option avroData = getData().getInsertValue(recordSchema, 
props);
+Option avroData = getDeserData(recordSchema, props);
 if (avroData.isPresent()) {
   return Option.of(new HoodieAvroIndexedRecord(avroData.get()));
 } else {
   return Option.empty();
 }
   }
 
+  private Option getDeserData(Schema recordSchema, Properties 
props) throws IOException {

Review Comment:
   nit: `getCachedDeserializedRecord`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org