[GitHub] [hudi] leoyy0316 commented on issue #5932: can not delete data when use spark scala code

2022-06-22 Thread GitBox


leoyy0316 commented on issue #5932:
URL: https://github.com/apache/hudi/issues/5932#issuecomment-1164023728

   spark cache cause thiss question, thanks @YannByron but why? can you Briefly 
explain?thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] rmahindra123 commented on issue #5893: [SUPPORT] Hudi write commit failing with PostgresDebeziumSource, SchemaRegistryProvider and PostgresDebeziumAvroPayload

2022-06-22 Thread GitBox


rmahindra123 commented on issue #5893:
URL: https://github.com/apache/hudi/issues/5893#issuecomment-1164022965

   @BalaMahesh For debezium pipeline, please do not configure the schema 
registry provider. The schema is only fetched within the source. Please refer 
to the spark submit command at the end of this blog: 
https://hudi.apache.org/cn/blog/2022/01/14/change-data-capture-with-debezium-and-apache-hudi/.
 Let me know if you still face issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YannByron commented on issue #5932: can not delete data when use spark scala code

2022-06-22 Thread GitBox


YannByron commented on issue #5932:
URL: https://github.com/apache/hudi/issues/5932#issuecomment-1164013327

   @leoyy0316 quit from the current spark session when you finish the DML 
operations, to judge if this is spark cache to lead this question.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904628542


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieInternalRowUtils.scala:
##
@@ -0,0 +1,300 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import java.nio.charset.StandardCharsets
+import java.util
+import java.util.concurrent.ConcurrentHashMap
+import org.apache.avro.Schema
+import org.apache.hudi.AvroConversionUtils
+import org.apache.hudi.avro.HoodieAvroUtils.{createFullName, fromJavaDate, 
toJavaDate}
+import org.apache.hudi.common.model.HoodieRecord.HoodieMetadataField
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, 
JoinedRow, MutableProjection, Projection}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
GenericArrayData, MapData}
+import 
org.apache.spark.sql.hudi.ColumnStatsExpressionUtils.AllowedTransformationExpression.exprUtils.generateMutableProjection
+import org.apache.spark.sql.types._
+import scala.collection.mutable
+
+
+object HoodieInternalRowUtils {
+
+  val projectionMap = new ConcurrentHashMap[(StructType, StructType), 
MutableProjection]
+  val schemaMap = new ConcurrentHashMap[Schema, StructType]
+  val SchemaPosMap = new ConcurrentHashMap[StructType, Map[String, 
(StructField, Int)]]
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#stitchRecords(org.apache.avro.generic.GenericRecord,
 org.apache.avro.generic.GenericRecord, org.apache.avro.Schema)
+   */
+  def stitchRecords(left: InternalRow, leftSchema: StructType, right: 
InternalRow, rightSchema: StructType, stitchedSchema: StructType): InternalRow 
= {
+val mergeSchema = StructType(leftSchema.fields ++ rightSchema.fields)
+val row = new JoinedRow(left, right)
+val projection = getCacheProjection(mergeSchema, stitchedSchema)
+projection(row)
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecord(org.apache.avro.generic.GenericRecord,
 org.apache.avro.Schema)
+   */
+  def rewriteRecord(oldRecord: InternalRow, oldSchema: StructType, newSchema: 
StructType): InternalRow = {
+val newRow = new 
GenericInternalRow(Array.fill(newSchema.fields.length)(null).asInstanceOf[Array[Any]])
+
+val oldFieldMap = getCacheSchemaPosMap(oldSchema)
+for ((field, pos) <- newSchema.fields.zipWithIndex) {
+  var oldValue: AnyRef = null
+  if (oldFieldMap.contains(field.name)) {
+val (oldField, oldPos) = oldFieldMap(field.name)
+oldValue = oldRecord.get(oldPos, oldField.dataType)
+  }
+  if (oldValue != null) {
+field.dataType match {
+  case structType: StructType =>
+val oldField = oldFieldMap(field.name)._1.asInstanceOf[StructType]
+rewriteRecord(oldValue.asInstanceOf[InternalRow], oldField, 
structType)
+  case decimalType: DecimalType =>
+val oldField = oldFieldMap(field.name)._1.asInstanceOf[DecimalType]
+if (decimalType.scale != oldField.scale || decimalType.precision 
!= oldField.precision) {
+  newRow.update(pos, 
Decimal.fromDecimal(oldValue.asInstanceOf[Decimal].toBigDecimal.setScale(newSchema.asInstanceOf[DecimalType].scale))
+  )
+} else {
+  newRow.update(pos, oldValue)
+}
+  case _ =>
+newRow.update(pos, oldValue)
+}
+  } else {
+// TODO default value in newSchema
+  }
+}
+
+newRow
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecordWithNewSchema(org.apache.avro.generic.IndexedRecord,
 org.apache.avro.Schema, java.util.Map)
+   */
+  def rewriteRecordWithNewSchema(oldRecord: InternalRow, oldSchema: 
StructType, newSchema: StructType, renameCols: util.Map[String, String]): 
InternalRow = {
+rewriteRecordWithNewSchema(oldRecord, oldSchema, newSchema, renameCols, 
new util.LinkedList[String]).asInstanceOf[InternalRow]
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecordWithNewSchema(java.lang.Object,
 org.apach

[GitHub] [hudi] hudi-bot commented on pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


hudi-bot commented on PR #5890:
URL: https://github.com/apache/hudi/pull/5890#issuecomment-1164000382

   
   ## CI report:
   
   * 08e1fa6f7820b82180d3c0352c1f92f2b4fe2c6a UNKNOWN
   * 7ae535a205aa373c1967b6dcc752ec4ee67a551c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9463)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5907: [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria…

2022-06-22 Thread GitBox


hudi-bot commented on PR #5907:
URL: https://github.com/apache/hudi/pull/5907#issuecomment-1163944480

   
   ## CI report:
   
   * 0738dc06bdabbd3838d819d921a1122729e36c7f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9385)
 
   * 38a841fd028140021c4f3d4e77aedf1ed749a5e8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9464)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904529468


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieInternalRowUtils.scala:
##
@@ -0,0 +1,300 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import java.nio.charset.StandardCharsets
+import java.util
+import java.util.concurrent.ConcurrentHashMap
+import org.apache.avro.Schema
+import org.apache.hudi.AvroConversionUtils
+import org.apache.hudi.avro.HoodieAvroUtils.{createFullName, fromJavaDate, 
toJavaDate}
+import org.apache.hudi.common.model.HoodieRecord.HoodieMetadataField
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, 
JoinedRow, MutableProjection, Projection}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
GenericArrayData, MapData}
+import 
org.apache.spark.sql.hudi.ColumnStatsExpressionUtils.AllowedTransformationExpression.exprUtils.generateMutableProjection
+import org.apache.spark.sql.types._
+import scala.collection.mutable
+
+
+object HoodieInternalRowUtils {
+
+  val projectionMap = new ConcurrentHashMap[(StructType, StructType), 
MutableProjection]
+  val schemaMap = new ConcurrentHashMap[Schema, StructType]
+  val SchemaPosMap = new ConcurrentHashMap[StructType, Map[String, 
(StructField, Int)]]
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#stitchRecords(org.apache.avro.generic.GenericRecord,
 org.apache.avro.generic.GenericRecord, org.apache.avro.Schema)
+   */
+  def stitchRecords(left: InternalRow, leftSchema: StructType, right: 
InternalRow, rightSchema: StructType, stitchedSchema: StructType): InternalRow 
= {
+val mergeSchema = StructType(leftSchema.fields ++ rightSchema.fields)
+val row = new JoinedRow(left, right)
+val projection = getCacheProjection(mergeSchema, stitchedSchema)
+projection(row)
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecord(org.apache.avro.generic.GenericRecord,
 org.apache.avro.Schema)
+   */
+  def rewriteRecord(oldRecord: InternalRow, oldSchema: StructType, newSchema: 
StructType): InternalRow = {
+val newRow = new 
GenericInternalRow(Array.fill(newSchema.fields.length)(null).asInstanceOf[Array[Any]])
+
+val oldFieldMap = getCacheSchemaPosMap(oldSchema)
+for ((field, pos) <- newSchema.fields.zipWithIndex) {
+  var oldValue: AnyRef = null
+  if (oldFieldMap.contains(field.name)) {
+val (oldField, oldPos) = oldFieldMap(field.name)
+oldValue = oldRecord.get(oldPos, oldField.dataType)
+  }
+  if (oldValue != null) {
+field.dataType match {
+  case structType: StructType =>
+val oldField = oldFieldMap(field.name)._1.asInstanceOf[StructType]
+rewriteRecord(oldValue.asInstanceOf[InternalRow], oldField, 
structType)
+  case decimalType: DecimalType =>
+val oldField = oldFieldMap(field.name)._1.asInstanceOf[DecimalType]
+if (decimalType.scale != oldField.scale || decimalType.precision 
!= oldField.precision) {
+  newRow.update(pos, 
Decimal.fromDecimal(oldValue.asInstanceOf[Decimal].toBigDecimal.setScale(newSchema.asInstanceOf[DecimalType].scale))
+  )
+} else {
+  newRow.update(pos, oldValue)
+}
+  case _ =>
+newRow.update(pos, oldValue)
+}
+  } else {
+// TODO default value in newSchema
+  }
+}
+
+newRow
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecordWithNewSchema(org.apache.avro.generic.IndexedRecord,
 org.apache.avro.Schema, java.util.Map)
+   */
+  def rewriteRecordWithNewSchema(oldRecord: InternalRow, oldSchema: 
StructType, newSchema: StructType, renameCols: util.Map[String, String]): 
InternalRow = {
+rewriteRecordWithNewSchema(oldRecord, oldSchema, newSchema, renameCols, 
new util.LinkedList[String]).asInstanceOf[InternalRow]
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecordWithNewSchema(java.lang.Object,
 org.apach

[GitHub] [hudi] hudi-bot commented on pull request #5907: [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria…

2022-06-22 Thread GitBox


hudi-bot commented on PR #5907:
URL: https://github.com/apache/hudi/pull/5907#issuecomment-1163941979

   
   ## CI report:
   
   * 0738dc06bdabbd3838d819d921a1122729e36c7f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9385)
 
   * 38a841fd028140021c4f3d4e77aedf1ed749a5e8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wzx140 commented on pull request #5907: [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria…

2022-06-22 Thread GitBox


wzx140 commented on PR #5907:
URL: https://github.com/apache/hudi/pull/5907#issuecomment-1163940471

   @xushiyan I add an ut about it. The problem is that the pos in ByteBuffer is 
not set to zero.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904523462


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieInternalRowUtils.scala:
##
@@ -0,0 +1,300 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import java.nio.charset.StandardCharsets
+import java.util
+import java.util.concurrent.ConcurrentHashMap
+import org.apache.avro.Schema
+import org.apache.hudi.AvroConversionUtils
+import org.apache.hudi.avro.HoodieAvroUtils.{createFullName, fromJavaDate, 
toJavaDate}
+import org.apache.hudi.common.model.HoodieRecord.HoodieMetadataField
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, 
JoinedRow, MutableProjection, Projection}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
GenericArrayData, MapData}
+import 
org.apache.spark.sql.hudi.ColumnStatsExpressionUtils.AllowedTransformationExpression.exprUtils.generateMutableProjection
+import org.apache.spark.sql.types._
+import scala.collection.mutable
+
+
+object HoodieInternalRowUtils {
+
+  val projectionMap = new ConcurrentHashMap[(StructType, StructType), 
MutableProjection]
+  val schemaMap = new ConcurrentHashMap[Schema, StructType]
+  val SchemaPosMap = new ConcurrentHashMap[StructType, Map[String, 
(StructField, Int)]]
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#stitchRecords(org.apache.avro.generic.GenericRecord,
 org.apache.avro.generic.GenericRecord, org.apache.avro.Schema)
+   */
+  def stitchRecords(left: InternalRow, leftSchema: StructType, right: 
InternalRow, rightSchema: StructType, stitchedSchema: StructType): InternalRow 
= {
+val mergeSchema = StructType(leftSchema.fields ++ rightSchema.fields)
+val row = new JoinedRow(left, right)
+val projection = getCacheProjection(mergeSchema, stitchedSchema)

Review Comment:
   done
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


hudi-bot commented on PR #5890:
URL: https://github.com/apache/hudi/pull/5890#issuecomment-1163933113

   
   ## CI report:
   
   * 08e1fa6f7820b82180d3c0352c1f92f2b4fe2c6a UNKNOWN
   * 8ce4693d9abc58cf40530dabdb32f8b2f2526865 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9462)
 
   * 7ae535a205aa373c1967b6dcc752ec4ee67a551c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9463)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


hudi-bot commented on PR #5890:
URL: https://github.com/apache/hudi/pull/5890#issuecomment-1163930584

   
   ## CI report:
   
   * 08e1fa6f7820b82180d3c0352c1f92f2b4fe2c6a UNKNOWN
   * 8ce4693d9abc58cf40530dabdb32f8b2f2526865 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9462)
 
   * 7ae535a205aa373c1967b6dcc752ec4ee67a551c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wzx140 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wzx140 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904515630


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieSparkRecordMerge.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi;
+
+import org.apache.avro.Schema;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieMerge;
+import org.apache.hudi.common.util.Option;
+
+import java.io.IOException;
+import java.util.Properties;
+
+public class HoodieSparkRecordMerge implements HoodieMerge {
+
+  @Override
+  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
+if (older.getData() == null) {
+  // use natural order for delete record
+  return older;
+}
+if (older.getOrderingValue().compareTo(newer.getOrderingValue()) > 0) {
+  return older;
+} else {
+  return newer;
+}
+  }
+
+  @Override
+  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) throws IOException {
+return Option.of(newer);
+  }

Review Comment:
   Yes, it finished in step3.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wzx140 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wzx140 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904514756


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieInternalRowUtils.scala:
##
@@ -0,0 +1,300 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import java.nio.charset.StandardCharsets
+import java.util
+import java.util.concurrent.ConcurrentHashMap
+import org.apache.avro.Schema
+import org.apache.hudi.AvroConversionUtils
+import org.apache.hudi.avro.HoodieAvroUtils.{createFullName, fromJavaDate, 
toJavaDate}
+import org.apache.hudi.common.model.HoodieRecord.HoodieMetadataField
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, 
JoinedRow, MutableProjection, Projection}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
GenericArrayData, MapData}
+import 
org.apache.spark.sql.hudi.ColumnStatsExpressionUtils.AllowedTransformationExpression.exprUtils.generateMutableProjection
+import org.apache.spark.sql.types._
+import scala.collection.mutable
+
+
+object HoodieInternalRowUtils {
+
+  val projectionMap = new ConcurrentHashMap[(StructType, StructType), 
MutableProjection]
+  val schemaMap = new ConcurrentHashMap[Schema, StructType]
+  val SchemaPosMap = new ConcurrentHashMap[StructType, Map[String, 
(StructField, Int)]]
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#stitchRecords(org.apache.avro.generic.GenericRecord,
 org.apache.avro.generic.GenericRecord, org.apache.avro.Schema)
+   */
+  def stitchRecords(left: InternalRow, leftSchema: StructType, right: 
InternalRow, rightSchema: StructType, stitchedSchema: StructType): InternalRow 
= {
+val mergeSchema = StructType(leftSchema.fields ++ rightSchema.fields)
+val row = new JoinedRow(left, right)
+val projection = getCacheProjection(mergeSchema, stitchedSchema)
+projection(row)
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecord(org.apache.avro.generic.GenericRecord,
 org.apache.avro.Schema)
+   */
+  def rewriteRecord(oldRecord: InternalRow, oldSchema: StructType, newSchema: 
StructType): InternalRow = {
+val newRow = new 
GenericInternalRow(Array.fill(newSchema.fields.length)(null).asInstanceOf[Array[Any]])
+
+val oldFieldMap = getCacheSchemaPosMap(oldSchema)
+for ((field, pos) <- newSchema.fields.zipWithIndex) {
+  var oldValue: AnyRef = null
+  if (oldFieldMap.contains(field.name)) {
+val (oldField, oldPos) = oldFieldMap(field.name)
+oldValue = oldRecord.get(oldPos, oldField.dataType)
+  }
+  if (oldValue != null) {
+field.dataType match {
+  case structType: StructType =>
+val oldField = oldFieldMap(field.name)._1.asInstanceOf[StructType]
+rewriteRecord(oldValue.asInstanceOf[InternalRow], oldField, 
structType)
+  case decimalType: DecimalType =>
+val oldField = oldFieldMap(field.name)._1.asInstanceOf[DecimalType]
+if (decimalType.scale != oldField.scale || decimalType.precision 
!= oldField.precision) {
+  newRow.update(pos, 
Decimal.fromDecimal(oldValue.asInstanceOf[Decimal].toBigDecimal.setScale(newSchema.asInstanceOf[DecimalType].scale))
+  )
+} else {
+  newRow.update(pos, oldValue)
+}
+  case _ =>
+newRow.update(pos, oldValue)
+}
+  } else {
+// TODO default value in newSchema
+  }
+}
+
+newRow
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecordWithNewSchema(org.apache.avro.generic.IndexedRecord,
 org.apache.avro.Schema, java.util.Map)
+   */
+  def rewriteRecordWithNewSchema(oldRecord: InternalRow, oldSchema: 
StructType, newSchema: StructType, renameCols: util.Map[String, String]): 
InternalRow = {
+rewriteRecordWithNewSchema(oldRecord, oldSchema, newSchema, renameCols, 
new util.LinkedList[String]).asInstanceOf[InternalRow]
+  }
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#rewriteRecordWithNewSchema(java.lang.Object,
 org.apache.a

[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904511545


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieInternalRowUtils.scala:
##
@@ -0,0 +1,300 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import java.nio.charset.StandardCharsets
+import java.util
+import java.util.concurrent.ConcurrentHashMap
+import org.apache.avro.Schema
+import org.apache.hudi.AvroConversionUtils
+import org.apache.hudi.avro.HoodieAvroUtils.{createFullName, fromJavaDate, 
toJavaDate}
+import org.apache.hudi.common.model.HoodieRecord.HoodieMetadataField
+import org.apache.hudi.exception.HoodieException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, 
JoinedRow, MutableProjection, Projection}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, 
GenericArrayData, MapData}
+import 
org.apache.spark.sql.hudi.ColumnStatsExpressionUtils.AllowedTransformationExpression.exprUtils.generateMutableProjection
+import org.apache.spark.sql.types._
+import scala.collection.mutable
+
+
+object HoodieInternalRowUtils {
+
+  val projectionMap = new ConcurrentHashMap[(StructType, StructType), 
MutableProjection]
+  val schemaMap = new ConcurrentHashMap[Schema, StructType]
+  val SchemaPosMap = new ConcurrentHashMap[StructType, Map[String, 
(StructField, Int)]]
+
+  /**
+   * @see 
org.apache.hudi.avro.HoodieAvroUtils#stitchRecords(org.apache.avro.generic.GenericRecord,
 org.apache.avro.generic.GenericRecord, org.apache.avro.Schema)

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] LinMingQiang commented on issue #5934: When reading the mor table with `QUERY_TYPE_SNAPSHOT`,Unable to correctly sort and de duplicate data by `PRECOMBINE_FIELD`.

2022-06-22 Thread GitBox


LinMingQiang commented on issue #5934:
URL: https://github.com/apache/hudi/issues/5934#issuecomment-1163918227

   by default FLink SQL uses Eventtimeavropayload,
   `MergeIterator.mergeRowWithLog` calls 
`record.getData().combineAndGetUpdateValue(historyAvroRecord, tableSchema)`  
instead of `record.getData().combineAndGetUpdateValue(historyAvroRecord, 
tableSchema, payloadConf)` . so The final call is 
`OverwriteWithLatestAvroPayload.combineAndGetUpdateValue(IndexedRecord,Schema)`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904506695


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java:
##
@@ -753,7 +761,9 @@ private Option mergeRowWithLog(
 String curKey) throws IOException {
   final HoodieAvroRecord record = (HoodieAvroRecord) 
scanner.getRecords().get(curKey);
   GenericRecord historyAvroRecord = (GenericRecord) 
rowDataToAvroConverter.convert(tableSchema, curRow);
-  return record.getData().combineAndGetUpdateValue(historyAvroRecord, 
tableSchema);
+  // TODO IndexedRecord to HoodieRecord

Review Comment:
   At that time, `TODO` was because `HoodieFlinkRecord` was not implemented. 
Currently, Avro record is still used. 
   I will remove this `TODO` first, and wait for the later implementation of 
`HoodieFlinkRecord`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904505296


##
hudi-common/src/main/java/org/apache/hudi/common/util/ReflectionUtils.java:
##
@@ -61,6 +63,10 @@ public static Class getClass(String clazzName) {
 return CLAZZ_CACHE.get(clazzName);
   }
 
+  private static Object getInstance(String clazzName) {
+return INSTANCE_CACHE.get(clazzName);
+  }

Review Comment:
   method removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904504762


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java:
##
@@ -169,15 +180,21 @@ public HoodieOperation getOperation() {
 return operation;
   }
 
+  public Comparable getOrderingValue() {
+if (null == orderingVal) {
+  // default natural order is 0
+  return 0;
+}
+return orderingVal;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904502980


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerge.java:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.metadata.HoodieMetadataPayload;
+
+import java.io.IOException;
+import java.util.Properties;
+
+import static org.apache.hudi.TypeUtils.unsafeCast;
+
+public class HoodieAvroRecordMerge implements HoodieMerge {
+  @Override
+  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
+HoodieRecordPayload picked = unsafeCast(((HoodieAvroRecord) 
newer).getData().preCombine(((HoodieAvroRecord) older).getData()));
+if (picked instanceof HoodieMetadataPayload) {
+  // NOTE: HoodieMetadataPayload return a new payload
+  return new HoodieAvroRecord(newer.getKey(), ((HoodieMetadataPayload) 
picked), newer.getOperation());
+}
+return picked.equals(((HoodieAvroRecord) newer).getData()) ? newer : older;
+  }
+
+  @Override
+  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) throws IOException {
+Option previousRecordAvroPayload;
+if (older instanceof HoodieAvroIndexedRecord) {
+  previousRecordAvroPayload = Option.of(((HoodieAvroIndexedRecord) 
older).getData());

Review Comment:
   `getData()` does not return null for now, but i aggre with `ofNullable() `.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904502478


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerge.java:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.metadata.HoodieMetadataPayload;
+
+import java.io.IOException;
+import java.util.Properties;
+
+import static org.apache.hudi.TypeUtils.unsafeCast;
+
+public class HoodieAvroRecordMerge implements HoodieMerge {
+  @Override
+  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
+HoodieRecordPayload picked = unsafeCast(((HoodieAvroRecord) 
newer).getData().preCombine(((HoodieAvroRecord) older).getData()));
+if (picked instanceof HoodieMetadataPayload) {
+  // NOTE: HoodieMetadataPayload return a new payload
+  return new HoodieAvroRecord(newer.getKey(), ((HoodieMetadataPayload) 
picked), newer.getOperation());
+}
+return picked.equals(((HoodieAvroRecord) newer).getData()) ? newer : older;
+  }
+
+  @Override
+  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) throws IOException {
+Option previousRecordAvroPayload;
+if (older instanceof HoodieAvroIndexedRecord) {
+  previousRecordAvroPayload = Option.of(((HoodieAvroIndexedRecord) 
older).getData());

Review Comment:
   `getData() ` does not return null



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904502220


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerge.java:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.metadata.HoodieMetadataPayload;
+
+import java.io.IOException;
+import java.util.Properties;
+
+import static org.apache.hudi.TypeUtils.unsafeCast;
+
+public class HoodieAvroRecordMerge implements HoodieMerge {
+  @Override
+  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
+HoodieRecordPayload picked = unsafeCast(((HoodieAvroRecord) 
newer).getData().preCombine(((HoodieAvroRecord) older).getData()));
+if (picked instanceof HoodieMetadataPayload) {
+  // NOTE: HoodieMetadataPayload return a new payload
+  return new HoodieAvroRecord(newer.getKey(), ((HoodieMetadataPayload) 
picked), newer.getOperation());
+}
+return picked.equals(((HoodieAvroRecord) newer).getData()) ? newer : older;
+  }
+
+  @Override
+  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) throws IOException {
+Option previousRecordAvroPayload;
+if (older instanceof HoodieAvroIndexedRecord) {
+  previousRecordAvroPayload = Option.of(((HoodieAvroIndexedRecord) 
older).getData());
+} else {
+  if (null == props) {
+previousRecordAvroPayload = 
((HoodieRecordPayload)older.getData()).getInsertValue(schema);

Review Comment:
   Nice advice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amit-ranjan-de commented on issue #5916: [SUPPORT] `show fsview latest` throwing IllegalStateException...pending compactions for merge_on_read table

2022-06-22 Thread GitBox


amit-ranjan-de commented on issue #5916:
URL: https://github.com/apache/hudi/issues/5916#issuecomment-1163906822

   Sure! Sharing it below
   
   ```
   aws s3 ls s3:///final_storage/hudi//data/.hoodie | sort -n
   ```
   
   ```
  PRE archived/
  PRE .aux/
  PRE .temp/
   2021-05-25 09:47:45  0 archived_$folder$
   2021-05-25 09:47:45  0 .aux_$folder$
   2021-05-25 09:47:45  0 .temp_$folder$
   2021-06-16 09:30:28  0 20210616092918.rollback.inflight
   2021-06-16 09:30:28 169110 20210616092918.rollback
   2021-07-06 09:12:37  0 20210706091200.rollback.inflight
   2021-07-06 09:12:37  60609 20210706091200.rollback
   2021-07-15 04:31:28  0 20210715043025.rollback.inflight
   2021-07-15 04:31:28 188963 20210715043025.rollback
   2021-08-07 00:20:48  0 20210807002047.rollback.inflight
   2021-08-07 00:20:48   1478 20210807002047.rollback
   2021-08-07 00:22:21  0 20210807002219.rollback.inflight
   2021-08-07 00:22:21   1478 20210807002219.rollback
   2021-08-07 00:26:42  0 20210807002640.rollback.inflight
   2021-08-07 00:26:42   1478 20210807002640.rollback
   2021-08-07 00:29:46  0 20210807002915.rollback.inflight
   2021-08-07 00:29:46  43216 20210807002915.rollback
   2021-08-19 02:35:09  0 20210819023345.rollback.inflight
   2021-08-19 02:35:09 217161 20210819023345.rollback
   2021-08-19 22:57:50  0 20210819225712.rollback.inflight
   2021-08-19 22:57:50  55866 20210819225712.rollback
   2021-08-20 00:02:26  0 2021082225.rollback.inflight
   2021-08-20 00:02:26   1478 2021082225.rollback
   2022-05-19 09:26:05345 hoodie.properties
   2022-06-20 23:34:25 416096 20220620232152.clean.requested
   2022-06-20 23:35:19 416096 20220620232152.clean.inflight
   2022-06-20 23:36:38 399176 20220620232152.clean
   2022-06-21 01:01:35 411099 20220621004846.clean.requested
   2022-06-21 01:02:42 411099 20220621004846.clean.inflight
   2022-06-21 01:04:01 394012 20220621004846.clean
   2022-06-21 02:35:30 409841 2022062101.clean.requested
   2022-06-21 02:36:53 409841 2022062101.clean.inflight
   2022-06-21 02:39:00 391956 2022062101.clean
   2022-06-21 04:39:02 415768 20220621042506.clean.requested
   2022-06-21 04:40:13 415768 20220621042506.clean.inflight
   2022-06-21 04:49:30 396450 20220621042506.clean
   2022-06-21 13:25:59  0 20220621132503.rollback.inflight
   2022-06-21 14:34:51 783199 20220621142459.clean.inflight
   2022-06-21 14:34:51 783199 20220621142459.clean.requested
   2022-06-21 14:35:05 699602 20220621142459.clean
   2022-06-21 15:30:39 429737 20220621152121.clean.inflight
   2022-06-21 15:30:39 429737 20220621152121.clean.requested
   2022-06-21 15:30:48 408748 20220621152121.clean
   2022-06-21 16:27:31 413866 20220621161828.clean.inflight
   2022-06-21 16:27:31 413866 20220621161828.clean.requested
   2022-06-21 16:27:40 394743 20220621161828.clean
   2022-06-21 17:23:45 411846 20220621171414.clean.inflight
   2022-06-21 17:23:45 411846 20220621171414.clean.requested
   2022-06-21 17:23:54 393003 20220621171414.clean
   2022-06-21 18:20:41 414588 20220621181027.clean.inflight
   2022-06-21 18:20:41 414588 20220621181027.clean.requested
   2022-06-21 18:20:49 395266 20220621181027.clean
   2022-06-21 19:18:16 413508 20220621190757.clean.inflight
   2022-06-21 19:18:16 413508 20220621190757.clean.requested
   2022-06-21 19:18:24 394486 20220621190757.clean
   2022-06-21 20:17:14 406109 20220621200554.clean.inflight
   2022-06-21 20:17:14 406109 20220621200554.clean.requested
   2022-06-21 20:17:23 387969 20220621200554.clean
   2022-06-21 21:16:47 408635 20220621210654.clean.requested
   2022-06-21 21:17:27 408635 20220621210654.clean.inflight
   2022-06-21 21:18:10 390031 20220621210654.clean
   2022-06-21 22:18:54 411488 20220621220740.clean.requested
   2022-06-21 22:19:14 411488 20220621220740.clean.inflight
   2022-06-21 22:19:33 392057 20220621220740.clean
   2022-06-21 23:22:34 407938 20220621231206.clean.requested
   2022-06-21 23:23:56 407938 20220621231206.clean.inflight
   2022-06-21 23:24:37 389097 20220621231206.clean
   2022-06-22 00:31:17 416307 20220622002044.clean.requested
   2022-06-22 00:32:31 416307 20220622002044.clean.inflight
   2022-06-22 00:33:29 395945 20220622002044.clean
   2022-06-22 01:42:12 420558 20220622012937.clean.requested
   2022-06-22 01:42:52 420558 20220622012937.clean.inflight
   2022-06-22 01:43:10 400307 20220622012937.clean
   2022-06-22 02:53:02 413953 20220622023950.clean.requested
   2022-06-22 02:54:16 413953 20220622023950.clean.inflight
   2022-06-22 02:54:34 394457 20220622023950.c

[jira] [Created] (HUDI-4305) Bulk_insert failed with flink-1.13

2022-06-22 Thread shuai.xu (Jira)
shuai.xu created HUDI-4305:
--

 Summary: Bulk_insert failed with flink-1.13 
 Key: HUDI-4305
 URL: https://issues.apache.org/jira/browse/HUDI-4305
 Project: Apache Hudi
  Issue Type: Bug
  Components: flink-sql
Affects Versions: 0.11.1
Reporter: shuai.xu
 Attachments: image-2022-06-23-12-06-11-154.png, 
image-2022-06-23-12-07-50-241.png

I wirte a flink-1.13 sql job to wite logs to hudi. When I config  
'write.operation' = 'bulk_insert' to use bulk insert, the job fail to run with 
an exception  !image-2022-06-23-12-06-11-154.png!

I check the compiled jar hudi-flink1.13-bundle_2.12-0.11.0.jar, and in 
SortOperator.class, it shows  !image-2022-06-23-12-07-50-241.png!

this seems to use the class of flink-1.14.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904499095


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerge.java:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.metadata.HoodieMetadataPayload;
+
+import java.io.IOException;
+import java.util.Properties;
+
+import static org.apache.hudi.TypeUtils.unsafeCast;
+
+public class HoodieAvroRecordMerge implements HoodieMerge {
+  @Override
+  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
+HoodieRecordPayload picked = unsafeCast(((HoodieAvroRecord) 
newer).getData().preCombine(((HoodieAvroRecord) older).getData()));
+if (picked instanceof HoodieMetadataPayload) {
+  // NOTE: HoodieMetadataPayload return a new payload
+  return new HoodieAvroRecord(newer.getKey(), ((HoodieMetadataPayload) 
picked), newer.getOperation());

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


wulei0302 commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r904499018


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java:
##
@@ -58,10 +58,8 @@ public HoodieData> deduplicateRecords(
   return Pair.of(key, record);
 }).reduceByKey((rec1, rec2) -> {
   @SuppressWarnings("unchecked")
-  HoodieRecord reducedRec = rec2.preCombine(rec1);
-  HoodieKey reducedKey = rec1.getData().equals(reducedRec) ? rec1.getKey() 
: rec2.getKey();
-
-  return (HoodieRecord) reducedRec.newInstance(reducedKey);
+  HoodieRecord reducedRecord =  hoodieMerge.preCombine(rec1, rec2);
+  return reducedRecord.newInstance();

Review Comment:
   Sorry, now should directly return to reducedRecord



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


hudi-bot commented on PR #5890:
URL: https://github.com/apache/hudi/pull/5890#issuecomment-1163900725

   
   ## CI report:
   
   * 08e1fa6f7820b82180d3c0352c1f92f2b4fe2c6a UNKNOWN
   * f848aee6edc047f633744d272a88f079bcf23adf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9458)
 
   * 8ce4693d9abc58cf40530dabdb32f8b2f2526865 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9462)
 
   * 7ae535a205aa373c1967b6dcc752ec4ee67a551c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5938: Why Hudi publish data size much more than the input file size when publish to hive

2022-06-22 Thread GitBox


nsivabalan commented on issue #5938:
URL: https://github.com/apache/hudi/issues/5938#issuecomment-1163897410

   "Getting small files from partitions" stage refers to reading existing data 
from hudi to fetch list of small file groups. So, this could refer to your hudi 
table size and not your incoming data size. does your existing hudi table size 
is 2.6 TB? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5934: When reading the mor table with `QUERY_TYPE_SNAPSHOT`,Unable to correctly sort and de duplicate data by `PRECOMBINE_FIELD`.

2022-06-22 Thread GitBox


nsivabalan commented on issue #5934:
URL: https://github.com/apache/hudi/issues/5934#issuecomment-1163895715

   by default hudi uses OverwriteWithLatestAvroPayload which does not honor 
precombine in all code paths. specifically when records in base file and 
records in log files are merged together. You can try using 
DefaultHoodieRecordPayload to achieve this. 
   
https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepayloadclass
   https://hudi.apache.org/docs/configurations/#writepayloadclass
   https://hudi.apache.org/docs/configurations/#hoodiecompactionpayloadclass
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


danny0405 commented on code in PR #5890:
URL: https://github.com/apache/hudi/pull/5890#discussion_r904487370


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringPlanSourceFunction.java:
##
@@ -75,7 +74,7 @@ public void open(Configuration parameters) throws Exception {
   public void run(SourceContext sourceContext) throws 
Exception {
 for (HoodieClusteringGroup clusteringGroup : 
clusteringPlan.getInputGroups()) {
   LOG.info("ClusteringPlanSourceFunction cluster " + clusteringGroup + " 
files");
-  sourceContext.collect(new 
ClusteringPlanEvent(this.instant.getTimestamp(), 
ClusteringGroupInfo.create(clusteringGroup), 
clusteringPlan.getStrategy().getStrategyParams()));

Review Comment:
   Execute clustering plan for instant {} as {} file slices ?



##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringPlanOperator.java:
##
@@ -0,0 +1,139 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.clustering;
+
+import org.apache.hudi.avro.model.HoodieClusteringGroup;
+import org.apache.hudi.avro.model.HoodieClusteringPlan;
+import org.apache.hudi.common.model.ClusteringGroupInfo;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.table.HoodieFlinkTable;
+import org.apache.hudi.util.ClusteringUtil;
+import org.apache.hudi.util.FlinkTables;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.annotation.VisibleForTesting;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.streaming.api.operators.AbstractStreamOperator;
+import org.apache.flink.streaming.api.operators.OneInputStreamOperator;
+import org.apache.flink.streaming.api.operators.Output;
+import org.apache.flink.streaming.runtime.streamrecord.StreamRecord;
+
+/**
+ * Operator that generates the clustering plan with pluggable strategies on 
finished checkpoints.
+ *
+ * It should be singleton to avoid conflicts.
+ */
+public class ClusteringPlanOperator extends 
AbstractStreamOperator
+implements OneInputStreamOperator {
+
+  /**
+   * Config options.
+   */
+  private final Configuration conf;
+
+  /**
+   * Meta Client.
+   */
+  @SuppressWarnings("rawtypes")
+  private transient HoodieFlinkTable table;
+
+  public ClusteringPlanOperator(Configuration conf) {
+this.conf = conf;
+  }
+
+  @Override
+  public void open() throws Exception {
+super.open();
+this.table = FlinkTables.createTable(conf, getRuntimeContext());
+// when starting up, rolls back all the inflight clustering instants if 
there exists,
+// these instants are in priority for scheduling task because the 
clustering instants are
+// scheduled from earliest(FIFO sequence).
+ClusteringUtil.rollbackClustering(table, 
StreamerUtil.createWriteClient(conf, getRuntimeContext()));
+  }
+
+  @Override
+  public void processElement(StreamRecord streamRecord) {
+// no operation
+  }
+
+  @Override
+  public void notifyCheckpointComplete(long checkpointId) {
+try {
+  table.getMetaClient().reloadActiveTimeline();
+  scheduleClustering(table, checkpointId);
+} catch (Throwable throwable) {
+  // make it fail-safe
+  LOG.error("Error while scheduling clustering plan for checkpoint: " + 
checkpointId, throwable);
+}
+  }
+
+  private void scheduleClustering(HoodieFlinkTable table, long 
checkpointId) {
+// the first instant takes the highest priority.
+Option firstRequested = Option.fromJavaOptional(
+
ClusteringUtils.getPendingClusteringInstantTimes(table.getMetaClient()).stream()
+.filter(instant -> instant.getState() == 
HoodieInstant.State.REQUESTED).findFirst());
+if (!firstRequested.isPresent()) {
+  // do nothing.
+  LOG.info("No clustering plan for checkpoint " + checkpointId);
+  return;
+}
+
+String clusteringInstantTime = firstRequested.get().getTimestamp();
+
+// generate clust

[GitHub] [hudi] nsivabalan commented on issue #5291: [SUPPORT] How to use hudi-defaults.conf with Glue

2022-06-22 Thread GitBox


nsivabalan commented on issue #5291:
URL: https://github.com/apache/hudi/issues/5291#issuecomment-1163872711

   @umehrot2 @zhedoubushishi : we may need to support this for "cluster" mode 
as well. as of now, some code change is required which is not easy to maintain. 
can we come up w/ a plan to fix this for 0.12. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leoyy0316 commented on issue #5932: can not delete data when use spark scala code

2022-06-22 Thread GitBox


leoyy0316 commented on issue #5932:
URL: https://github.com/apache/hudi/issues/5932#issuecomment-1163872024

   @nsivabalan @YannByron 
   i have try another way
   
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.{HoodieStorageConfig, HoodieWriteConfig}
   import org.apache.spark.sql.{SaveMode, SparkSession}
   
   val df = spark.sql("select *, concat(order_no,order_type) as id ,0 as ts  
from ods_us.ods_cis_dbo_order_header limit 5")
   
   import org.apache.spark.sql.types.{BooleanType, StructField, StructType}
   val df1 = df.withColumn("_hoodie_is_deleted", lit(false).cast(BooleanType))
   
   
df1.write.format("org.apache.hudi").option(DataSourceWriteOptions.RECORDKEY_FIELD.key(),
 "id").option(DataSourceWriteOptions.PRECOMBINE_FIELD.key(), 
"ts").option(DataSourceWriteOptions.HIVE_DATABASE.key(), 
"temp_db").option(DataSourceWriteOptions.HIVE_TABLE.key(), 
"scala_ods_cis_dbo_order_header_leo_1").option(DataSourceWriteOptions.HIVE_SYNC_ENABLED.key(),
 "true").option(DataSourceWriteOptions.HIVE_URL.key(), 
"jdbc:hive2://xxx:xxx").option(DataSourceWriteOptions.HIVE_USER.key(), 
"xxx").option(DataSourceWriteOptions.HIVE_PASS.key(), 
"xxx").option(DataSourceWriteOptions.SQL_INSERT_MODE.key(), 
"non-strict").option(HoodieWriteConfig.TBL_NAME.key(), 
"scala_ods_cis_dbo_order_header_leo_1").option("hoodie.bulkinsert.shuffle.parallelism",
 2).option(DataSourceWriteOptions.OPERATION.key(), 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL).mode(SaveMode.Append).save("/user/xxx/testhudi/scala_ods_cis_dbo_order_header_leo_1")
   
   val df_delete = spark.sql("select * from 
temp_db.scala_ods_cis_dbo_order_header_leo_1 limit 1")
   
   val dff = df_delete.withColumn("_hoodie_is_deleted", 
lit(true).cast(BooleanType))
   
   
dff.write.format("org.apache.hudi").option(DataSourceWriteOptions.RECORDKEY_FIELD.key(),
 "id").option(DataSourceWriteOptions.HIVE_DATABASE.key(), 
"temp_db").option(DataSourceWriteOptions.HIVE_TABLE.key(), 
"scala_ods_cis_dbo_order_header_leo_1").option(DataSourceWriteOptions.HIVE_SYNC_ENABLED.key(),
 "true").option(DataSourceWriteOptions.HIVE_URL.key(), 
"jdbc:hive2://xxx:xxx").option(DataSourceWriteOptions.HIVE_USER.key(), 
"xxx").option(DataSourceWriteOptions.HIVE_PASS.key(), 
"xxx").option(DataSourceWriteOptions.PRECOMBINE_FIELD.key(), 
"ts").option("hoodie.insert.shuffle.parallelism", 
"2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.bulkinsert.shuffle.parallelism", 
"2").option("hoodie.delete.shuffle.parallelism", 
"2").mode(SaveMode.Append).save("/user/xxx/testhudi/scala_ods_cis_dbo_order_header_leo_1")


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


hudi-bot commented on PR #5890:
URL: https://github.com/apache/hudi/pull/5890#issuecomment-1163870928

   
   ## CI report:
   
   * 08e1fa6f7820b82180d3c0352c1f92f2b4fe2c6a UNKNOWN
   * f848aee6edc047f633744d272a88f079bcf23adf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9458)
 
   * 8ce4693d9abc58cf40530dabdb32f8b2f2526865 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9462)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


hudi-bot commented on PR #5890:
URL: https://github.com/apache/hudi/pull/5890#issuecomment-1163868853

   
   ## CI report:
   
   * 08e1fa6f7820b82180d3c0352c1f92f2b4fe2c6a UNKNOWN
   * f848aee6edc047f633744d272a88f079bcf23adf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9458)
 
   * 8ce4693d9abc58cf40530dabdb32f8b2f2526865 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leoyy0316 commented on issue #5932: can not delete data when use spark scala code

2022-06-22 Thread GitBox


leoyy0316 commented on issue #5932:
URL: https://github.com/apache/hudi/issues/5932#issuecomment-1163866883

   @nsivabalan @YannByron 
   val df = spark.sql("select * from temp_db.hudi_mor_tbl_ts_delete limit 1")  
df show the full record, and set ".option("hoodie.datasource.write.operation", 
WriteOperationType.DELETE.value()) " in scala code when delete, i copy this 
code from hudicode :HoodieDataSourceExample
   
   i use spark3.0.1 some configs:
   "spark.serializer": "org.apache.spark.serializer.KryoSerializer" ,
   "spark.sql.extensions": 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
   
   I found a phenomenon, when I query from hive,the data successfully deleted, 
but when i query from spark, the data not delete successfully
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5927: [HUDI-4292] Update the RFC-46 doc because the Record Merge API is changed from CombineEngine to HoodieMerge

2022-06-22 Thread GitBox


hudi-bot commented on PR #5927:
URL: https://github.com/apache/hudi/pull/5927#issuecomment-1163866749

   
   ## CI report:
   
   * 1784fe48a0c573597b9c4aa8a9b352f4379a7554 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9461)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5917: [HUDI-4279] Strength the remote fs view lagging check when latest com…

2022-06-22 Thread GitBox


danny0405 commented on code in PR #5917:
URL: https://github.com/apache/hudi/pull/5917#discussion_r904468953


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -138,14 +137,31 @@ private boolean isLocalViewBehind(Context ctx) {
 String localTimelineHash = localTimeline.getTimelineHash();
 // refresh if timeline hash mismatches and if local's last known instant < 
client's last known instant (if config is enabled)
 if (!localTimelineHash.equals(timelineHashFromClient)
-&& (!timelineServiceConfig.refreshTimelineBasedOnLatestCommit || 
HoodieTimeline.compareTimestamps(localLastKnownInstant, 
HoodieTimeline.LESSER_THAN, lastKnownInstantFromClient))) {
+&& (!timelineServiceConfig.refreshTimelineBasedOnLatestCommit
+|| localTimelineBehind(localTimeline, lastKnownInstantFromClient, 
numInstantsFromClient))) {
   return true;
 }
 
 // As a safety check, even if hash is same, ensure instant is present
 return 
!localTimeline.containsOrBeforeTimelineStarts(lastKnownInstantFromClient);
   }
 
+  private static boolean localTimelineBehind(HoodieTimeline localTimeline, 
String lastKnownInstantFromClient, String numInstantsFromClient) {
+String localLastKnownInstant = localTimeline.lastInstant().isPresent() ? 
localTimeline.lastInstant().get().getTimestamp()
+: HoodieTimeline.INVALID_INSTANT_TS;
+// Why comparing the num commits ?
+// Assumes there are 4 commits on the timeline:
+// timestamp(action): ts_0(commit), ts_1(commit), ts_2(clean), ts_3(commit)
+// when ts_1 is in INFLIGHT state, ts_2 clean action is already finished,
+// after ts_1 triggers #sync, the local timeline is refreshed as [ts_0, 
ts_2],
+// when ts_1 switches state from INFLIGHT to COMPLETED, no #sync triggers.
+// at ts_3, when the fs view snapshot is requested, the ts_3 client 
timeline should be [ts_0, ts_1, ts_2],
+// if we only compare the latest commit, the local timeline is NOT behind, 
but the fs view is not complete
+// because ts_1 is lost.
+return HoodieTimeline.compareTimestamps(localLastKnownInstant, 
HoodieTimeline.LESSER_THAN, lastKnownInstantFromClient)
+|| localTimeline.countInstants() < 
Integer.parseInt(numInstantsFromClient);

Review Comment:
   We can i guess, but keeping
   
   `HoodieTimeline.compareTimestamps(localLastKnownInstant, 
HoodieTimeline.LESSER_THAN, lastKnownInstantFromClient)`
   
   as a fast check is fine i think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-4300) Add sync clean and archive for compaction service in Spark Env

2022-06-22 Thread aidendong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

aidendong reassigned HUDI-4300:
---

Assignee: aidendong

> Add sync clean and archive for compaction service in Spark Env
> --
>
> Key: HUDI-4300
> URL: https://issues.apache.org/jira/browse/HUDI-4300
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, spark
>Reporter: aidendong
>Assignee: aidendong
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current situation is to provide asynchronous clean and archive in 
> compaction.
>  
> {code:java}
> // SparkRDDWriteClient.java
> @Override
> protected HoodieWriteMetadata> compact(String 
> compactionInstantTime, boolean shouldComplete) {
>   HoodieSparkTable table = HoodieSparkTable.create(config, context);
>   preWrite(compactionInstantTime, WriteOperationType.COMPACT, 
> table.getMetaClient());
>  
> } {code}
> The asynchronous archive will get distribute lock when  
> {color:#FF}hoodie.write.concurrency.mode=OPTIMISTIC_CONCURRENCY_CONTROL{color}.
> *Archive may be locked for a long time* 
> for example in spark env, In offline scheduleAndCompaction and 
> {color:#172b4d} 
> hoodie.write.concurrency.mode=OPTIMISTIC_CONCURRENCY_CONTROL。{color}
> {color:#172b4d}Maybe all task work on compaction and archive function does 
> not have enough resources to work when it get lock.{color}
> I think, we can provide sync clean and archive for users to choose
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] xiarixiaoyao commented on issue #5915: [SUPPORT] Schema Evolution - Changing field type from Date to String is not supported

2022-06-22 Thread GitBox


xiarixiaoyao commented on issue #5915:
URL: https://github.com/apache/hudi/issues/5915#issuecomment-1163853370

   @Reimus  only full schema evolution support change date type to string.
   
   when you use this feature, you should  execute set 
hoodie.schema.on.read.enable=true, before execute alter command. 
   https://hudi.apache.org/docs/schema_evolution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YannByron commented on issue #5932: can not delete data when use spark scala code

2022-06-22 Thread GitBox


YannByron commented on issue #5932:
URL: https://github.com/apache/hudi/issues/5932#issuecomment-1163848379

   ```
   val spark = SparkSession
   .builder
   .appName("delete_init_leo")
   .enableHiveSupport()
   .getOrCreate()
   ```
   you also need to provide the `spark.serializer`, `spark.sql.extensions`, 
`spark.sql.catalog.spark_catalog` configs which maybe you provide when use 
spark-sql.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Reimus commented on issue #5915: [SUPPORT] Schema Evolution - Changing field type from Date to String is not supported

2022-06-22 Thread GitBox


Reimus commented on issue #5915:
URL: https://github.com/apache/hudi/issues/5915#issuecomment-1163824902

   @minihippo 
   No, just a regular field.
   It was included in the data skipping index though.
   
   Here is a creation / update spark.write commands I'm using to add/save data
   ```
ds.write
 .format("hudi")
 .mode(SaveMode.Append)
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD.key, "ts")
 .option(DataSourceWriteOptions.RECORDKEY_FIELD.key, "id")
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key, "ym")
 .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
 .option(HoodieWriteConfig.TBL_NAME.key, tableName)
 .option(DataSourceWriteOptions.RECONCILE_SCHEMA.key, "true")
 .option(DataSourceWriteOptions.TABLE_TYPE.key, 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
 .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
 .option(HoodieTableConfig.TIMELINE_TIMEZONE.key, 
HoodieTimelineTimeZone.UTC.name)
 .option(HoodieWriteConfig.SCHEMA_EVOLUTION_ENABLE.key, "true")
 .option(HoodieWriteConfig.AVRO_SCHEMA_VALIDATE_ENABLE.key, "true")
 .option(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key, 
WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name())
 .option(HoodieIndexConfig.BLOOM_FILTER_TYPE.key, 
BloomFilterTypeCode.DYNAMIC_V0.name)
 .option(HoodieIndexConfig.BLOOM_FILTER_NUM_ENTRIES_VALUE.key, 
String.valueOf(10))
 .option(HoodieIndexConfig.BLOOM_INDEX_USE_METADATA.key, "true")
 .option(HoodieIndexConfig.BLOOM_INDEX_FILTER_DYNAMIC_MAX_ENTRIES.key, 
String.valueOf(100))
 .option(HoodieLockConfig.HIVE_DATABASE_NAME.key, databaseName)
 .option(HoodieLockConfig.HIVE_TABLE_NAME.key, tableName)
 .option(HoodieLockConfig.HIVE_METASTORE_URI.key, 
env.spark.hiveMetastore)
 .option(HoodieLockConfig.LOCK_PROVIDER_CLASS_NAME.key, 
classOf[org.apache.hudi.hive.HiveMetastoreBasedLockProvider].getName)
 .option(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key, 
String.valueOf(256 * 1024 * 1024))
 .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE.key, String.valueOf(256 
* 1024 * 1024))
 .option(HoodieCompactionConfig.AUTO_CLEAN.key, "true")
 .option(HoodieCompactionConfig.FAILED_WRITES_CLEANER_POLICY.key, 
HoodieFailedWritesCleaningPolicy.LAZY.name)
   
 .option(HoodieCompactionConfig.CLEANER_POLICY.key, 
HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS.name())
 .option(HoodieCompactionConfig.CLEANER_HOURS_RETAINED.key, 
String.valueOf(24))
 .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key, 
String.valueOf(104857600))
   
 .option(HoodieMetadataConfig.COLUMN_STATS_INDEX_FOR_COLUMNS.key, 
"ym,ymd,date,ts,lvl1.ymd,lvl1.lvl2.date")
 .option(HoodieMetadataConfig.BLOOM_FILTER_INDEX_FOR_COLUMNS.key, 
"id,col1,col2")
 .option(HoodieMetadataConfig.POPULATE_META_FIELDS.key, "true")
 .option(HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key, 
"true")
 .option(HoodieMetadataConfig.ENABLE_METADATA_INDEX_BLOOM_FILTER.key, 
"true")
 .option(HoodieMetadataConfig.ENABLE.key, "true")
 .save("/tmp/hudi")
   ```
   
   Partition field is `ym` field (string) - ymd is a regular date field.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on issue #5770: [SUPPORT] hoodie.parquet.max.file.size Property is Being Ignored

2022-06-22 Thread GitBox


YuweiXiao commented on issue #5770:
URL: https://github.com/apache/hudi/issues/5770#issuecomment-1163824258

   Clustering is a table service that re-organize the data files' layout, maybe 
it is not relevant in your case.
   
   About the `average record size`, hudi use this 
(`hoodie.copyonwrite.record.size.estimate`, default 1KB) to estimate the total 
file size in your initial write (no available commits to compute the average 
size).  
   
   The code for file size management in `UpsertPartitioner::assignInserts`. 
Maybe you could also check logs and see if it take effects.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on issue #5770: [SUPPORT] hoodie.parquet.max.file.size Property is Being Ignored

2022-06-22 Thread GitBox


YuweiXiao commented on issue #5770:
URL: https://github.com/apache/hudi/issues/5770#issuecomment-1163823698

   Clustering is a table service that re-organize the data files' layout, maybe 
it is not relevant in your case.
   
   About the `average record size`, hudi use this 
(`hoodie.copyonwrite.record.size.estimate`, default 1KB) to estimate the total 
file size in your initial write (no available commits to compute the average 
size).  
   
   The code for file size management in `UpsertPartitioner::assignInserts`. 
Maybe you could also check logs and see if it take effects.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5927: [HUDI-4292] Update the RFC-46 doc because the Record Merge API is changed from CombineEngine to HoodieMerge

2022-06-22 Thread GitBox


hudi-bot commented on PR #5927:
URL: https://github.com/apache/hudi/pull/5927#issuecomment-1163809615

   
   ## CI report:
   
   * 8fadf110223c07eb561aa8f80d6cd45bd5e8bacc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9460)
 
   * 1784fe48a0c573597b9c4aa8a9b352f4379a7554 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9461)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5927: [HUDI-4292] Update the RFC-46 doc because the Record Merge API is changed from CombineEngine to HoodieMerge

2022-06-22 Thread GitBox


hudi-bot commented on PR #5927:
URL: https://github.com/apache/hudi/pull/5927#issuecomment-1163807741

   
   ## CI report:
   
   * 8fadf110223c07eb561aa8f80d6cd45bd5e8bacc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9460)
 
   * 1784fe48a0c573597b9c4aa8a9b352f4379a7554 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wulei0302 commented on pull request #5927: [HUDI-4292] Update the RFC-46 doc because the Record Merge API is changed from CombineEngine to HoodieMerge

2022-06-22 Thread GitBox


wulei0302 commented on PR #5927:
URL: https://github.com/apache/hudi/pull/5927#issuecomment-1163806308

   @hudi-bot  run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] bhasudha opened a new pull request, #5944: [DOCS] Fix trino jar bundle in query engine setup page

2022-06-22 Thread GitBox


bhasudha opened a new pull request, #5944:
URL: https://github.com/apache/hudi/pull/5944

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] LinMingQiang closed issue #5886: Throw `NoSuchElementException: FileID xx of partition path xx does not exist.` when execute `HoodieMergeHandle.getLatestBaseFile` but FileID is exist

2022-06-22 Thread GitBox


LinMingQiang closed issue #5886: Throw `NoSuchElementException:  FileID xx of 
partition path xx does not exist.` when execute 
`HoodieMergeHandle.getLatestBaseFile` but FileID is exist in path.
URL: https://github.com/apache/hudi/issues/5886


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5548: [SUPPORT] Hudi global configuration on EMR

2022-06-22 Thread GitBox


nsivabalan commented on issue #5548:
URL: https://github.com/apache/hudi/issues/5548#issuecomment-1163798341

   @umehrot2 @zhedoubushishi : Can you follow up on this issue please. 
   CC @yihua 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4186) Support Hudi with Spark 3.3

2022-06-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4186:
-
Labels: pull-request-available release-blocker  (was: release-blocker)

> Support Hudi with Spark 3.3
> ---
>
> Key: HUDI-4186
> URL: https://issues.apache.org/jira/browse/HUDI-4186
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: spark
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Critical
>  Labels: pull-request-available, release-blocker
> Fix For: 0.12.0
>
>
> Spark 3.3 voting is currently in progress and should like go through soon 
> [https://github.com/apache/spark/tree/v3.3.0-rc4|https://github.com/apache/spark/tree/v3.3.0-rc4.]
> We should support it for our next major release 0.12.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] CTTY opened a new pull request, #5943: [HUDI-4186] [WIP] Support Hudi with Spark 3.3.0

2022-06-22 Thread GitBox


CTTY opened a new pull request, #5943:
URL: https://github.com/apache/hudi/pull/5943

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   Support Spark 3.3.0
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5932: can not delete data when use spark scala code

2022-06-22 Thread GitBox


nsivabalan commented on issue #5932:
URL: https://github.com/apache/hudi/issues/5932#issuecomment-1163662167

   did you verify that 
   ```
   val df = spark.sql("select * from temp_db.hudi_mor_tbl_ts_delete limit 1")
   ```
   above df was valid and did show the full record? bcoz, we do have unit tests 
around this. So, not sure how this is failing. Are you sure you are setting the 
operation type correctly to "delete" ? 
   
   Infact our quick start goes through similar example. 
   CC @minihippo @YannByron : Is there anything specific to spark-sql layer 
that could impact how deletes are done. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5916: [SUPPORT] `show fsview latest` throwing IllegalStateException...pending compactions for merge_on_read table

2022-06-22 Thread GitBox


nsivabalan commented on issue #5916:
URL: https://github.com/apache/hudi/issues/5916#issuecomment-1163656421

   guess, @minihippo is asking you to list ".hoodie" folder and post your 
output here. ensure the result is sorted based on file mod time. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5904: [SUPPORT] Spark saveAsTable() not working as expected

2022-06-22 Thread GitBox


nsivabalan commented on issue #5904:
URL: https://github.com/apache/hudi/issues/5904#issuecomment-1163652199

   @YannByron @minihippo: Can either of you folks follow up on this issue 
please. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5899: [SUPPORT] MERGE INTO with UPDATE */ INESRT * - new incoming columns dropped, automatic schema evolution feature

2022-06-22 Thread GitBox


nsivabalan commented on issue #5899:
URL: https://github.com/apache/hudi/issues/5899#issuecomment-1163647932

   @xiarixiaoyao : Can you look into this issue. Looks like its related to 
schema evolution. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5893: [SUPPORT] Hudi write commit failing with PostgresDebeziumSource, SchemaRegistryProvider and PostgresDebeziumAvroPayload

2022-06-22 Thread GitBox


nsivabalan commented on issue #5893:
URL: https://github.com/apache/hudi/issues/5893#issuecomment-1163645638

   @BalaMahesh : gentle ping. 
   @rmahindra123 : gentle ping. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5886: Throw `NoSuchElementException: FileID xx of partition path xx does not exist.` when execute `HoodieMergeHandle.getLatestBaseFile` but FileID is ex

2022-06-22 Thread GitBox


nsivabalan commented on issue #5886:
URL: https://github.com/apache/hudi/issues/5886#issuecomment-1163643955

   @LinMingQiang  : So, with the linked patch, I assume things are good? can we 
close the github issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #5917: [HUDI-4279] Strength the remote fs view lagging check when latest com…

2022-06-22 Thread GitBox


nsivabalan commented on code in PR #5917:
URL: https://github.com/apache/hudi/pull/5917#discussion_r904268325


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -138,14 +137,31 @@ private boolean isLocalViewBehind(Context ctx) {
 String localTimelineHash = localTimeline.getTimelineHash();
 // refresh if timeline hash mismatches and if local's last known instant < 
client's last known instant (if config is enabled)
 if (!localTimelineHash.equals(timelineHashFromClient)
-&& (!timelineServiceConfig.refreshTimelineBasedOnLatestCommit || 
HoodieTimeline.compareTimestamps(localLastKnownInstant, 
HoodieTimeline.LESSER_THAN, lastKnownInstantFromClient))) {
+&& (!timelineServiceConfig.refreshTimelineBasedOnLatestCommit
+|| localTimelineBehind(localTimeline, lastKnownInstantFromClient, 
numInstantsFromClient))) {
   return true;
 }
 
 // As a safety check, even if hash is same, ensure instant is present
 return 
!localTimeline.containsOrBeforeTimelineStarts(lastKnownInstantFromClient);
   }
 
+  private static boolean localTimelineBehind(HoodieTimeline localTimeline, 
String lastKnownInstantFromClient, String numInstantsFromClient) {
+String localLastKnownInstant = localTimeline.lastInstant().isPresent() ? 
localTimeline.lastInstant().get().getTimestamp()
+: HoodieTimeline.INVALID_INSTANT_TS;
+// Why comparing the num commits ?
+// Assumes there are 4 commits on the timeline:
+// timestamp(action): ts_0(commit), ts_1(commit), ts_2(clean), ts_3(commit)
+// when ts_1 is in INFLIGHT state, ts_2 clean action is already finished,
+// after ts_1 triggers #sync, the local timeline is refreshed as [ts_0, 
ts_2],
+// when ts_1 switches state from INFLIGHT to COMPLETED, no #sync triggers.
+// at ts_3, when the fs view snapshot is requested, the ts_3 client 
timeline should be [ts_0, ts_1, ts_2],
+// if we only compare the latest commit, the local timeline is NOT behind, 
but the fs view is not complete
+// because ts_1 is lost.
+return HoodieTimeline.compareTimestamps(localLastKnownInstant, 
HoodieTimeline.LESSER_THAN, lastKnownInstantFromClient)
+|| localTimeline.countInstants() < 
Integer.parseInt(numInstantsFromClient);

Review Comment:
   should we check just "<" or "!=" to catch any mismatch ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4304) DELETE_PARTITON doesn't raise exception although the target partition doesn't exists

2022-06-22 Thread Gatsby Lee (Jira)
Gatsby Lee created HUDI-4304:


 Summary: DELETE_PARTITON doesn't raise exception although the 
target partition doesn't exists
 Key: HUDI-4304
 URL: https://issues.apache.org/jira/browse/HUDI-4304
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Gatsby Lee


My stack is AWS Glue + AWS Glue Catalog + S3.
When I use DELETE_PARTITION, I see a different behavior between 0.10.1 and 
0.11.0.
In 0.10.1, when I run DELETE_PARTITION for non-existing partition, it 
failed(exception raised) because the partition doesn’t exists.



[~shivnarayan] confirmed that the expected behavior is raising exception when 
DELETE_PARTITION tries to delete non-existing partition.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4303) Partition pruning fails for non-string partition field in Spark

2022-06-22 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-4303:
---

 Summary: Partition pruning fails for non-string partition field in 
Spark 
 Key: HUDI-4303
 URL: https://issues.apache.org/jira/browse/HUDI-4303
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Affects Versions: 0.11.1
Reporter: Ethan Guo
 Fix For: 0.12.0


When querying a partitioned Hudi table storing github archive data (schema 
shown below) with the partition field in timestamp type, the query triggering 
partition prunning fails due to ClassCastException.

Environment: Spark 3.2.1, Hudi 0.11.1.  With Hudi 0.10.0, the filtering works.

Schema:
{code:java}
scala> df.printSchema
root
 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- type: string (nullable = true)
 |-- public: boolean (nullable = false)
 |-- payload: string (nullable = true)
 |-- repo: struct (nullable = false)
 |    |-- id: long (nullable = false)
 |    |-- name: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- actor: struct (nullable = false)
 |    |-- id: long (nullable = false)
 |    |-- login: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- url: string (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |-- org: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- url: string (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- id: string (nullable = true)
 |-- other: string (nullable = true) {code}
hoodie.properties:
{code:java}
hoodie.table.name=github-raw
hoodie.table.type=MERGE_ON_READ
hoodie.table.version=4
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
hoodie.archivelog.folder=archived
hoodie.table.base.file.format=PARQUET
hoodie.table.precombine.field=created_at
hoodie.table.partition.fields=created_at
hoodie.table.recordkey.fields=id
hoodie.populate.meta.fields=true
hoodie.table.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
hoodie.timeline.layout.version=1
hoodie.table.checksum=3814878680 {code}
Fullstack stace:
{code:java}
scala> val df = 
spark.read.format("hudi").load("").filter(col("created_at").between("2021-10",
 "2022-03"))
scala> df.count
java.lang.ClassCastException: java.lang.Long cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:49)
  at scala.math.LowPriorityOrderingImplicits$$anon$2.compare(Ordering.scala:150)
  at scala.math.Ordering.gteq(Ordering.scala:94)
  at scala.math.Ordering.gteq$(Ordering.scala:94)
  at scala.math.LowPriorityOrderingImplicits$$anon$2.gteq(Ordering.scala:149)
  at 
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.nullSafeEval(predicates.scala:1153)
  at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:574)
  at org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:724)
  at org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:720)
  at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:64)
  at 
org.apache.hudi.SparkHoodieTableFileIndex.$anonfun$prunePartition$4(SparkHoodieTableFileIndex.scala:186)
  at 
org.apache.hudi.SparkHoodieTableFileIndex.$anonfun$prunePartition$4$adapted(SparkHoodieTableFileIndex.scala:186)
  at 
scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
  at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
  at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
  at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
  at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
  at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
  at 
org.apache.hudi.SparkHoodieTableFileIndex.prunePartition(SparkHoodieTableFileIndex.scala:186)
  at 
org.apache.hudi.SparkHoodieTableFileIndex.listFileSlices(SparkHoodieTableFileIndex.scala:147)
  at 
org.apache.hudi.MergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:104)
  at 
org.apache.hudi.MergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:41)
  at org.apache.hudi.HoodieBaseRelation.buildScan(Hoo

[GitHub] [hudi] h7kanna commented on pull request #5815: [HUDI-4213] Infer keygen clazz for Spark SQL

2022-06-22 Thread GitBox


h7kanna commented on PR #5815:
URL: https://github.com/apache/hudi/pull/5815#issuecomment-1163529628

   https://github.com/apache/hudi/issues/5548 This the root cause for my 
problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] h7kanna commented on issue #5548: [SUPPORT] Hudi global configuration on EMR

2022-06-22 Thread GitBox


h7kanna commented on issue #5548:
URL: https://github.com/apache/hudi/issues/5548#issuecomment-1163516414

   Related https://github.com/apache/hudi/issues/5291


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4291) Test TestCleanPlanExecutor.testKeepLatestFileVersions is flaky

2022-06-22 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4291:
--
Epic Link: HUDI-4302

> Test TestCleanPlanExecutor.testKeepLatestFileVersions is flaky
> --
>
> Key: HUDI-4291
> URL: https://issues.apache.org/jira/browse/HUDI-4291
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Danny Chen
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.0
>
>
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9418/logs/33]
>  
>  
> https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9413/logs/36



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4296) Test TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky

2022-06-22 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4296:
--
Epic Link: HUDI-4302

> Test TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky
> --
>
> Key: HUDI-4296
> URL: https://issues.apache.org/jira/browse/HUDI-4296
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Danny Chen
>Priority: Major
>
> https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9416/logs/35



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4297) Test TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersWithoutConflicts is flaky

2022-06-22 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4297:
--
Epic Link: HUDI-4302

> Test 
> TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWritersWithoutConflicts
>  is flaky
> -
>
> Key: HUDI-4297
> URL: https://issues.apache.org/jira/browse/HUDI-4297
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Danny Chen
>Priority: Major
>
> https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9418/logs/36



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4302) CI Instability / flaky tests

2022-06-22 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-4302:
-

 Summary: CI Instability / flaky tests
 Key: HUDI-4302
 URL: https://issues.apache.org/jira/browse/HUDI-4302
 Project: Apache Hudi
  Issue Type: Epic
  Components: tests-ci
Reporter: sivabalan narayanan


Creating an EPIC to track the flaky tests



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4295) Test TestHoodieActiveTimeline.testCreateNewInstantTime is flaky

2022-06-22 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4295:
--
Epic Link: HUDI-4302

> Test TestHoodieActiveTimeline.testCreateNewInstantTime is flaky
> ---
>
> Key: HUDI-4295
> URL: https://issues.apache.org/jira/browse/HUDI-4295
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Danny Chen
>Priority: Major
>
> https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9416/logs/16



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5786: [HUDI-2955] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase)

2022-06-22 Thread GitBox


hudi-bot commented on PR #5786:
URL: https://github.com/apache/hudi/pull/5786#issuecomment-1163481245

   
   ## CI report:
   
   * d0eccf905d4778c46160e08fb48d4087bfd3d5d3 UNKNOWN
   * 0519143412014cf44a61571635eea4beb5688638 UNKNOWN
   * cdb27bea041845d0056ab013f8f637dbbe0cc739 UNKNOWN
   * b74fd7780201b26fd2ca6036fa8e653926222329 UNKNOWN
   * d3036a6419a59f99c52588a66b1c016a1ec3eabf UNKNOWN
   * a3ab3a5dd76d279a456c810839830ac2d324907d UNKNOWN
   * 66fc1912e14fce4ceba3b2b5f3122d531b783aeb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9439)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5786: [HUDI-2955] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase)

2022-06-22 Thread GitBox


hudi-bot commented on PR #5786:
URL: https://github.com/apache/hudi/pull/5786#issuecomment-1163477311

   
   ## CI report:
   
   * d0eccf905d4778c46160e08fb48d4087bfd3d5d3 UNKNOWN
   * 0519143412014cf44a61571635eea4beb5688638 UNKNOWN
   * cdb27bea041845d0056ab013f8f637dbbe0cc739 UNKNOWN
   * b74fd7780201b26fd2ca6036fa8e653926222329 UNKNOWN
   * d3036a6419a59f99c52588a66b1c016a1ec3eabf UNKNOWN
   * a3ab3a5dd76d279a456c810839830ac2d324907d UNKNOWN
   * 66fc1912e14fce4ceba3b2b5f3122d531b783aeb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9439)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] rahil-c commented on pull request #5786: [HUDI-2955] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase)

2022-06-22 Thread GitBox


rahil-c commented on PR #5786:
URL: https://github.com/apache/hudi/pull/5786#issuecomment-1163476933

   @hudi-bot run azure 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5927: [HUDI-4292] Update the RFC-46 doc because the Record Merge API is changed from CombineEngine to HoodieMerge

2022-06-22 Thread GitBox


hudi-bot commented on PR #5927:
URL: https://github.com/apache/hudi/pull/5927#issuecomment-1163469475

   
   ## CI report:
   
   * 8fadf110223c07eb561aa8f80d6cd45bd5e8bacc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vicuna96 opened a new issue, #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

2022-06-22 Thread GitBox


vicuna96 opened a new issue, #5942:
URL: https://github.com/apache/hudi/issues/5942

   
   **Describe the problem you faced**
   
   Case 1.
   We are currently trying to create a partial upsert pipeline with global 
index (GLOBAL_BLOOM). The issue that we face is that when setting 
HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "true", we 
notice that **the columns not updated by the partial update are dropped / 
nullified**.
   
   Case 2.
   In addition, as an alternative we are exploring using 
HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "false". 
However, in this case we notice that while the metadata column 
`_hoodie_partition_path` does not get updated, our partition field does.
   In our unit testing, this means that for the record in question, the columns 
become `_hoodie_partition_path: partitionField=2022-05-07` and `partitionField: 
2022-05-08`. We are wondering if there is any implications to this. For 
example, if there is any pruning in place on `_hoodie_partition_path`, is our 
record with mismatch in partition column info prone to any inconsistencies?
   
   **To Reproduce**
   Case 1.
   Define 
   ```
   case class TestHudiTable(keyField: String, stringField: String, numberField: 
Int, precombineField: Timestamp, partitionField: Date)
   val targetGlobalPartition = "2022-05-08"
   
   val insertRecords = Seq(
 TestHudiTable("key1", "value3", 55, Timestamp.valueOf("2022-05-07 
08:00:00"), Date.valueOf("2022-05-07")),
 TestHudiTable("key2", "value4", 66, Timestamp.valueOf("2022-05-07 
09:00:00"), Date.valueOf("2022-05-07")),
 TestHudiTable("key3", "value4", 77, Timestamp.valueOf("2022-05-07 
10:00:00"), Date.valueOf("2022-05-07")))
   
   val insertDF = insertRecords.toDF(keyField, stringField, numberField, 
precombineField, partitionField)
 .withColumn(precombineField, col(precombineField).cast(TimestampType))
 .withColumn(partitionField, to_date(col(partitionField)))
   ```
   Then run an original insert of these records. Finally, test the partial 
upsert with the following records, using 
org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload
   ```
   val partialUpdates = Seq(
 ("key1", "value5", "2022-05-07T11:00:00", "2022-05-07"),
 ("key3", "value6", "2022-05-07T12:00:00", targetGlobalPartition)).toDF(
   keyField, stringField, precombineField, partitionField).withColumn(
   precombineField, 
col(precombineField).cast(TimestampType)).withColumn(
   partitionField, to_date(col(partitionField)))
   ```
   
   Hence, we are testing a partial update that updates most columns except 
numberField, which will be null.
   ```
   **Before partial update to records corresponding to key1 and key3.**
   
+---+-+--+-+--++---+---+---+--+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name 

|keyField|stringField|numberField|precombineField|partitionField|
   
+---+-+--+-+--++---+---+---+--+
   |20220622111807610  |20220622111807610_0_3|keyField:key1 
|partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key1
|value3 |55 |2022-05-07 08:00:00|2022-05-07|
   |20220622111807610  |20220622111807610_0_4|keyField:key3 
|partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key3
|value4 |77 |2022-05-07 10:00:00|2022-05-07|
   |20220622111813282  |20220622111813282_0_4|keyField:key2 
|partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key2
|value4 |66 |2022-05-07 09:00:00|2022-05-07|
   
+---+-+--+-+--++---+---+---+--+
   
   **After partial update to key1 and key3, with the latter also updating the 
partition column.**
   
+---+-+--+-+--++---+---+---+--+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name 

|keyField|strin

[GitHub] [hudi] yihua commented on a diff in pull request #5941: [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups

2022-06-22 Thread GitBox


yihua commented on code in PR #5941:
URL: https://github.com/apache/hudi/pull/5941#discussion_r904053034


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -973,6 +974,7 @@ Stream fetchAllFileSlices(String partitionPath) {
*/
   public Stream fetchLatestBaseFiles(final String 
partitionPath) {
 return fetchAllStoredFileGroups(partitionPath)
+.filter(fg -> !isFileGroupReplaced(fg))

Review Comment:
   Good catch!  I see that `getLatestBaseFiles(String partitionStr)` filters 
out the replaced file groups.  Should that API be used in Presto Hive 
connector?  Also, should we audit all similar APIs regarding compaction and 
clustering?
   
   Still, to be on par with `fetchLatestBaseFiles()`, this needs to be fixed 
anyway.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5941: [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups

2022-06-22 Thread GitBox


hudi-bot commented on PR #5941:
URL: https://github.com/apache/hudi/pull/5941#issuecomment-1163414156

   
   ## CI report:
   
   * 0bf33ad19c2a2e7ab9f0473854029d2adfeae974 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9459)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


hudi-bot commented on PR #5890:
URL: https://github.com/apache/hudi/pull/5890#issuecomment-1163413985

   
   ## CI report:
   
   * 08e1fa6f7820b82180d3c0352c1f92f2b4fe2c6a UNKNOWN
   * f848aee6edc047f633744d272a88f079bcf23adf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9458)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #5627: [HUDI-3350][HUDI-3351] Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-06-22 Thread GitBox


xushiyan commented on code in PR #5627:
URL: https://github.com/apache/hudi/pull/5627#discussion_r903936471


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerge.java:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.metadata.HoodieMetadataPayload;
+
+import java.io.IOException;
+import java.util.Properties;
+
+import static org.apache.hudi.TypeUtils.unsafeCast;
+
+public class HoodieAvroRecordMerge implements HoodieMerge {
+  @Override
+  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
+HoodieRecordPayload picked = unsafeCast(((HoodieAvroRecord) 
newer).getData().preCombine(((HoodieAvroRecord) older).getData()));
+if (picked instanceof HoodieMetadataPayload) {
+  // NOTE: HoodieMetadataPayload return a new payload
+  return new HoodieAvroRecord(newer.getKey(), ((HoodieMetadataPayload) 
picked), newer.getOperation());
+}
+return picked.equals(((HoodieAvroRecord) newer).getData()) ? newer : older;
+  }
+
+  @Override
+  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) throws IOException {
+Option previousRecordAvroPayload;
+if (older instanceof HoodieAvroIndexedRecord) {
+  previousRecordAvroPayload = Option.of(((HoodieAvroIndexedRecord) 
older).getData());

Review Comment:
   can `getData()` return null ? `ofNullable()` looks safer



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java:
##
@@ -58,10 +58,8 @@ public HoodieData> deduplicateRecords(
   return Pair.of(key, record);
 }).reduceByKey((rec1, rec2) -> {
   @SuppressWarnings("unchecked")
-  HoodieRecord reducedRec = rec2.preCombine(rec1);
-  HoodieKey reducedKey = rec1.getData().equals(reducedRec) ? rec1.getKey() 
: rec2.getKey();
-
-  return (HoodieRecord) reducedRec.newInstance(reducedKey);
+  HoodieRecord reducedRecord =  hoodieMerge.preCombine(rec1, rec2);
+  return reducedRecord.newInstance();

Review Comment:
   can you clarify the purpose of `newInstance()` pls?



##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecordMerge.java:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.metadata.HoodieMetadataPayload;
+
+import java.io.IOException;
+import java.util.Properties;
+
+import static org.apache.hudi.TypeUtils.unsafeCast;
+
+public class HoodieAvroRecordMerge implements HoodieMerge {
+  @Override
+  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
+HoodieRecordPayload picked = unsafeCast(((HoodieAvroRecord) 
newer).getData().preCombine(((HoodieAvroRecord) older).getData()));
+if (picked instanceof HoodieMetadataPayload) {
+  // NOTE: HoodieMetadataPayload return a new payload
+  return new HoodieAvroRecord(newer.getKey(), ((HoodieMetadataPayload) 
picked), newer.getOperation());
+}
+return picked.equals(((HoodieAvroRecord) newer).getData()) ? newer : older;
+

[GitHub] [hudi] kasured commented on issue #5843: [SUPPORT] Hoodie can request and complete commits far in the future on its timeline

2022-06-22 Thread GitBox


kasured commented on issue #5843:
URL: https://github.com/apache/hudi/issues/5843#issuecomment-1163331800

   @nsivabalan to-day we saw a case when it is not only future commit that has 
been created but also the commit with the non valid date 
   
   2022-06-16 02:31:03   2443 20220631003058.clean
   2022-06-16 02:31:01   2417 20220631003058.clean.inflight
   2022-06-16 02:31:00   2417 20220631003058.clean.requested
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Rap70r commented on issue #5770: [SUPPORT] hoodie.parquet.max.file.size Property is Being Ignored

2022-06-22 Thread GitBox


Rap70r commented on issue #5770:
URL: https://github.com/apache/hudi/issues/5770#issuecomment-1163266839

   Hi @YuweiXiao, I'm not familiar with the clustering config you mentioned. 
Can you please provide details?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5890: [HUDI-4273] Support inline schedule clustering for Flink stream

2022-06-22 Thread GitBox


hudi-bot commented on PR #5890:
URL: https://github.com/apache/hudi/pull/5890#issuecomment-1163260514

   
   ## CI report:
   
   * 08e1fa6f7820b82180d3c0352c1f92f2b4fe2c6a UNKNOWN
   * 6f7e0eb38269361a49694e5047df673edf01e538 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9456)
 
   * f848aee6edc047f633744d272a88f079bcf23adf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9458)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5940: [HUDI-4300] Add sync clean and archive for compaction service in Spark Env

2022-06-22 Thread GitBox


hudi-bot commented on PR #5940:
URL: https://github.com/apache/hudi/pull/5940#issuecomment-1163253543

   
   ## CI report:
   
   * 97ad7d11493c21337de15849d02d1ebe8737b65b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9457)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5828: [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception

2022-06-22 Thread GitBox


hudi-bot commented on PR #5828:
URL: https://github.com/apache/hudi/pull/5828#issuecomment-1163253126

   
   ## CI report:
   
   * 269bd2bb902f08111903504134f1419c7320e9b0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9455)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-4291) Test TestCleanPlanExecutor.testKeepLatestFileVersions is flaky

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-4291:
-

Assignee: Sagar Sumit

> Test TestCleanPlanExecutor.testKeepLatestFileVersions is flaky
> --
>
> Key: HUDI-4291
> URL: https://issues.apache.org/jira/browse/HUDI-4291
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Danny Chen
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.0
>
>
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9418/logs/33]
>  
>  
> https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9413/logs/36



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-4011) Add a Hudi AWS bundle

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-4011.
-
Resolution: Fixed

> Add a Hudi AWS bundle
> -
>
> Key: HUDI-4011
> URL: https://issues.apache.org/jira/browse/HUDI-4011
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Udit Mehrotra
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> As was raised in [https://github.com/apache/hudi/issues/5451,] there Hudi AWS 
> jars were moved out of hudi-spark-bundle. Hence, customers need to manually 
> pass jars like DynamoDb lock client, DynamoDb aws sdk etc to be able to use 
> DynamoDb lock provider implementation.
> We need an AWS specific bundle, that packages these dependencies to make it 
> easier for customers. They can use this bundle along with hudi-spark-bundle 
> when they need to use DynamoDb lock provider.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4029) test out different lock providers using our integ test infra

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4029:
--
Status: In Progress  (was: Open)

> test out different lock providers using our integ test infra
> 
>
> Key: HUDI-4029
> URL: https://issues.apache.org/jira/browse/HUDI-4029
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3991) Provide bundle jar options in each e2e test pipeline

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3991:
--
Status: Patch Available  (was: In Progress)

> Provide bundle jar options in each e2e test pipeline
> 
>
> Key: HUDI-3991
> URL: https://issues.apache.org/jira/browse/HUDI-3991
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Make integ test bundle slim and run tests w/ actual bundles



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-2202) Add Trino to Docker Demo

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-2202.
-
Resolution: Done

It can be closed. We have Trino setup in docker demo: 
https://hudi.apache.org/docs/docker_demo/#step-4-d-run-trino-queries

> Add Trino to Docker Demo
> 
>
> Key: HUDI-2202
> URL: https://issues.apache.org/jira/browse/HUDI-2202
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-2707) Perf test snapshot queries for COW table

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-2707.
-
Resolution: Done

Based on perf tests, the snapshot querying or COW table is on par with parquet 
tables.

> Perf test snapshot queries for COW table
> 
>
> Key: HUDI-2707
> URL: https://issues.apache.org/jira/browse/HUDI-2707
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-2158) Upstream support for MOR tables.

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-2158.
-
Resolution: Won't Fix

> Upstream support for MOR tables.
> 
>
> Key: HUDI-2158
> URL: https://issues.apache.org/jira/browse/HUDI-2158
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-2158) Upstream support for MOR tables.

2022-06-22 Thread Sagar Sumit (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557502#comment-17557502
 ] 

Sagar Sumit commented on HUDI-2158:
---

There is a patch for MoR table support in hive connector: 
[https://github.com/trinodb/trino/pull/9641]

However, we will take up all new dev work in the new hudi connector: 
[https://github.com/trinodb/trino/pull/10228]

Closing this ticket. 

> Upstream support for MOR tables.
> 
>
> Key: HUDI-2158
> URL: https://issues.apache.org/jira/browse/HUDI-2158
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] minihippo commented on a diff in pull request #5629: [WIP][HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-06-22 Thread GitBox


minihippo commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r903861647


##
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##
@@ -436,14 +444,18 @@ public static GenericRecord removeFields(GenericRecord 
record, List fiel
 
   private static void copyOldValueOrSetDefault(GenericRecord oldRecord, 
GenericRecord newRecord, Schema.Field field) {
 Schema oldSchema = oldRecord.getSchema();
-Object fieldValue = oldSchema.getField(field.name()) == null ? null : 
oldRecord.get(field.name());
+Field oldSchemaField = oldSchema.getField(field.name());
+Object fieldValue = oldSchemaField == null ? null : 
oldRecord.get(field.name());

Review Comment:
   final name = null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3853) Integ Tests running against Spark3

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3853:
--
Priority: Blocker  (was: Major)

> Integ Tests running against Spark3
> --
>
> Key: HUDI-3853
> URL: https://issues.apache.org/jira/browse/HUDI-3853
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rahil Chertara
>Priority: Blocker
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1891) Jetty Dependency conflict when upgrade to hive3.1.1 and hadoop3.0.0

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1891:
--
Priority: Blocker  (was: Critical)

> Jetty Dependency conflict when upgrade to hive3.1.1 and hadoop3.0.0
> ---
>
> Key: HUDI-1891
> URL: https://issues.apache.org/jira/browse/HUDI-1891
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: shenbing
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> when package hudi 0.7.0 or 0.9.0-SNAPSHOT using 
> {code:java}
> mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
> -Drat.skip=true -Dhadoop.version=3.0.0  -Dhive.version=3.1.1{code}
> and then import hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar into my project. I 
> got a error :
>  
> {code:java}
> org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)Vjava.lang.NoSuchMethodError:
>  
> org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> at 
> io.javalin.core.util.JettyServerUtil.defaultSessionHandler(JettyServerUtil.kt:50)
> at io.javalin.Javalin.(Javalin.java:94)
> at io.javalin.Javalin.create(Javalin.java:107)
> at 
> org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:156)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:88)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineServerHelper.createEmbeddedTimelineService(EmbeddedTimelineServerHelper.java:56)
> at 
> org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:109)
> at 
> org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:77)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:132)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:120)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.(SparkRDDWriteClient.java:84)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1891) Jetty Dependency conflict when upgrade to hive3.1.1 and hadoop3.0.0

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1891:
--
Status: In Progress  (was: Open)

> Jetty Dependency conflict when upgrade to hive3.1.1 and hadoop3.0.0
> ---
>
> Key: HUDI-1891
> URL: https://issues.apache.org/jira/browse/HUDI-1891
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: shenbing
>Priority: Critical
>  Labels: pull-request-available
>
> when package hudi 0.7.0 or 0.9.0-SNAPSHOT using 
> {code:java}
> mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
> -Drat.skip=true -Dhadoop.version=3.0.0  -Dhive.version=3.1.1{code}
> and then import hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar into my project. I 
> got a error :
>  
> {code:java}
> org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)Vjava.lang.NoSuchMethodError:
>  
> org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> at 
> io.javalin.core.util.JettyServerUtil.defaultSessionHandler(JettyServerUtil.kt:50)
> at io.javalin.Javalin.(Javalin.java:94)
> at io.javalin.Javalin.create(Javalin.java:107)
> at 
> org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:156)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:88)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineServerHelper.createEmbeddedTimelineService(EmbeddedTimelineServerHelper.java:56)
> at 
> org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:109)
> at 
> org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:77)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:132)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:120)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.(SparkRDDWriteClient.java:84)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1891) Jetty Dependency conflict when upgrade to hive3.1.1 and hadoop3.0.0

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1891:
--
Fix Version/s: 0.12.0

> Jetty Dependency conflict when upgrade to hive3.1.1 and hadoop3.0.0
> ---
>
> Key: HUDI-1891
> URL: https://issues.apache.org/jira/browse/HUDI-1891
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: shenbing
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> when package hudi 0.7.0 or 0.9.0-SNAPSHOT using 
> {code:java}
> mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
> -Drat.skip=true -Dhadoop.version=3.0.0  -Dhive.version=3.1.1{code}
> and then import hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar into my project. I 
> got a error :
>  
> {code:java}
> org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)Vjava.lang.NoSuchMethodError:
>  
> org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> at 
> io.javalin.core.util.JettyServerUtil.defaultSessionHandler(JettyServerUtil.kt:50)
> at io.javalin.Javalin.(Javalin.java:94)
> at io.javalin.Javalin.create(Javalin.java:107)
> at 
> org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:156)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:88)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineServerHelper.createEmbeddedTimelineService(EmbeddedTimelineServerHelper.java:56)
> at 
> org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:109)
> at 
> org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:77)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:132)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:120)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.(SparkRDDWriteClient.java:84)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-1891) Jetty Dependency conflict when upgrade to hive3.1.1 and hadoop3.0.0

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1891:
--
Status: Patch Available  (was: In Progress)

> Jetty Dependency conflict when upgrade to hive3.1.1 and hadoop3.0.0
> ---
>
> Key: HUDI-1891
> URL: https://issues.apache.org/jira/browse/HUDI-1891
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: shenbing
>Priority: Critical
>  Labels: pull-request-available
>
> when package hudi 0.7.0 or 0.9.0-SNAPSHOT using 
> {code:java}
> mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
> -Drat.skip=true -Dhadoop.version=3.0.0  -Dhive.version=3.1.1{code}
> and then import hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar into my project. I 
> got a error :
>  
> {code:java}
> org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)Vjava.lang.NoSuchMethodError:
>  
> org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> at 
> io.javalin.core.util.JettyServerUtil.defaultSessionHandler(JettyServerUtil.kt:50)
> at io.javalin.Javalin.(Javalin.java:94)
> at io.javalin.Javalin.create(Javalin.java:107)
> at 
> org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:156)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:88)
> at 
> org.apache.hudi.client.embedded.EmbeddedTimelineServerHelper.createEmbeddedTimelineService(EmbeddedTimelineServerHelper.java:56)
> at 
> org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:109)
> at 
> org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:77)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:132)
> at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:120)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.(SparkRDDWriteClient.java:84)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-2955) Upgrade Hadoop to 3.3.x

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2955:
--
Status: Patch Available  (was: In Progress)

> Upgrade Hadoop to 3.3.x
> ---
>
> Key: HUDI-2955
> URL: https://issues.apache.org/jira/browse/HUDI-2955
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Rahil Chertara
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2021-12-07 at 2.32.51 PM.png
>
>
> According to Hadoop compatibility matrix, this is a pre-requisite to 
> upgrading to JDK11:
> !Screen Shot 2021-12-07 at 2.32.51 PM.png|width=938,height=230!
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions]
>  
> *Upgrading Hadoop from 2.x to 3.x*
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts]
> Everything (relevant to us) seems to be in a good shape, except Spark 2.2/.3



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-3911) Async indexer blog for 0.11 release

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-3911.
-
Resolution: Done

> Async indexer blog for 0.11 release
> ---
>
> Key: HUDI-3911
> URL: https://issues.apache.org/jira/browse/HUDI-3911
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-4150) Doc updates for 0.11.1

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-4150.
-
Resolution: Done

> Doc updates for 0.11.1
> --
>
> Key: HUDI-4150
> URL: https://issues.apache.org/jira/browse/HUDI-4150
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Bhavani Sudha Saktheeswaran
>Priority: Major
> Fix For: 0.11.0
>
>
> Add comments for missing documentation on our website. It could be faq or 
> configurations or website docs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4150) Doc updates for 0.11.1

2022-06-22 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4150:
--
Fix Version/s: 0.11.0

> Doc updates for 0.11.1
> --
>
> Key: HUDI-4150
> URL: https://issues.apache.org/jira/browse/HUDI-4150
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Bhavani Sudha Saktheeswaran
>Priority: Major
> Fix For: 0.11.0
>
>
> Add comments for missing documentation on our website. It could be faq or 
> configurations or website docs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4301) Detect HoodieCombine type with engine type automatically instead of the default avro-based one

2022-06-22 Thread Frank Wong (Jira)
Frank Wong created HUDI-4301:


 Summary: Detect HoodieCombine type with engine type automatically 
instead of the default avro-based one
 Key: HUDI-4301
 URL: https://issues.apache.org/jira/browse/HUDI-4301
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Frank Wong






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


  1   2   3   >