[GitHub] [hudi] wangxianghu commented on a change in pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-25 Thread GitBox


wangxianghu commented on a change in pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#discussion_r563537637



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -181,16 +183,33 @@ object DataSourceWriteOptions {
   @Deprecated
   val DEFAULT_STORAGE_TYPE_OPT_VAL = COW_STORAGE_TYPE_OPT_VAL
 
-  def translateStorageTypeToTableType(optParams: Map[String, String]) : 
Map[String, String] = {
+  def translateOptParams(optParams: Map[String, String]): Map[String, String] 
= {
+// translate StorageType to TableType
+var newOptParams = optParams
 if (optParams.contains(STORAGE_TYPE_OPT_KEY) && 
!optParams.contains(TABLE_TYPE_OPT_KEY)) {
   log.warn(STORAGE_TYPE_OPT_KEY + " is deprecated and will be removed in a 
later release; Please use " + TABLE_TYPE_OPT_KEY)
-  optParams ++ Map(TABLE_TYPE_OPT_KEY -> optParams(STORAGE_TYPE_OPT_KEY))
-} else {
-  optParams
+  newOptParams = optParams ++ Map(TABLE_TYPE_OPT_KEY -> 
optParams(STORAGE_TYPE_OPT_KEY))
 }
+// translate the api partitionBy of spark DataFrameWriter to 
PARTITIONPATH_FIELD_OPT_KEY
+if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY) && 
!optParams.contains(PARTITIONPATH_FIELD_OPT_KEY)) {
+  val partitionColumns = 
optParams.get(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)
+.map(SparkDataSourceUtils.decodePartitioningColumns)
+.getOrElse(Nil)
+
+  val keyGeneratorClass = 
optParams.getOrElse(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY,
+DataSourceWriteOptions.DEFAULT_KEYGENERATOR_CLASS_OPT_VAL)
+  val partitionPathField =
+keyGeneratorClass match {
+  case "org.apache.hudi.keygen.CustomKeyGenerator" =>
+partitionColumns.map(e => s"$e:SIMPLE").mkString(",")

Review comment:
   we can not simply put `SIMPLE` and `partitionBy` field together. Since 
when user use `CustomKeyGenerator ` and the partitionpath field is of timestamp 
type,  the str after the `partitionBy` field should be `TIMESTAMP`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2021-01-25 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-766649165


   Hi @nsivabalan,
   
   I think we can close this issue for now. I've changed from GLOBAL_BLOOM to 
SIMPLE index with static partition keys, cause GLOBAL_BLOOM was too slow in my 
use case. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2430?src=pr&el=h1) Report
   > Merging 
[#2430](https://codecov.io/gh/apache/hudi/pull/2430?src=pr&el=desc) (2c4fa32) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c4afd179c1983a382b8a5197d800b0f5dba254de?el=desc)
 (c4afd17) will **increase** coverage by `2.21%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2430/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2430?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2430  +/-   ##
   
   + Coverage 50.18%   52.39%   +2.21% 
   + Complexity 3050  552-2498 
   
 Files   419   92 -327 
 Lines 18931 4096   -14835 
 Branches   1948  480-1468 
   
   - Hits   9500 2146-7354 
   + Misses 8656 1751-6905 
   + Partials775  199 -576 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.21% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2430?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...udi/common/table/timeline/dto/CompactionOpDTO.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9Db21wYWN0aW9uT3BEVE8uamF2YQ==)
 | | | |
   | 
[...3/internal/HoodieBulkInsertDataInternalWriter.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3NwYXJrMy9pbnRlcm5hbC9Ib29kaWVCdWxrSW5zZXJ0RGF0YUludGVybmFsV3JpdGVyLmphdmE=)
 | | | |
   | 
[...i/common/table/timeline/HoodieDefaultTimeline.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZURlZmF1bHRUaW1lbGluZS5qYXZh)
 | | | |
   | 
[.../java/org/apache/hudi/common/util/HoodieTimer.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvSG9vZGllVGltZXIuamF2YQ==)
 | | | |
   | 
[...able/timeline/versioning/AbstractMigratorBase.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvQWJzdHJhY3RNaWdyYXRvckJhc2UuamF2YQ==)
 | | | |
   | 
[...common/table/log/HoodieMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVNZXJnZWRMb2dSZWNvcmRTY2FubmVyLmphdmE=)
 | | | |
   | 
[...i/bootstrap/SparkParquetBootstrapDataProvider.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYm9vdHN0cmFwL1NwYXJrUGFycXVldEJvb3RzdHJhcERhdGFQcm92aWRlci5qYXZh)
 | | | |
   | 
[...rg/apache/hudi/hadoop/HoodieROTablePathFilter.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZVJPVGFibGVQYXRoRmlsdGVyLmphdmE=)
 | | | |
   | 
[...rg/apache/hudi/common/model/HoodieFileGroupId.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUZpbGVHcm91cElkLmphdmE=)
 | | | |
   | 
[...pache/hudi/hadoop/HoodieColumnProjectionUtils.java](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZUNvbHVtblByb2plY3Rpb25VdGlscy5qYXZh)
 | | | |
   | ... and [307 
more](https://codecov.io/gh/apache/hudi/pull/2430/diff?src=pr&el=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to

[GitHub] [hudi] codecov-io edited a comment on pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] teeyog commented on a change in pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-25 Thread GitBox


teeyog commented on a change in pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#discussion_r563598187



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -181,16 +183,33 @@ object DataSourceWriteOptions {
   @Deprecated
   val DEFAULT_STORAGE_TYPE_OPT_VAL = COW_STORAGE_TYPE_OPT_VAL
 
-  def translateStorageTypeToTableType(optParams: Map[String, String]) : 
Map[String, String] = {
+  def translateOptParams(optParams: Map[String, String]): Map[String, String] 
= {
+// translate StorageType to TableType
+var newOptParams = optParams
 if (optParams.contains(STORAGE_TYPE_OPT_KEY) && 
!optParams.contains(TABLE_TYPE_OPT_KEY)) {
   log.warn(STORAGE_TYPE_OPT_KEY + " is deprecated and will be removed in a 
later release; Please use " + TABLE_TYPE_OPT_KEY)
-  optParams ++ Map(TABLE_TYPE_OPT_KEY -> optParams(STORAGE_TYPE_OPT_KEY))
-} else {
-  optParams
+  newOptParams = optParams ++ Map(TABLE_TYPE_OPT_KEY -> 
optParams(STORAGE_TYPE_OPT_KEY))
 }
+// translate the api partitionBy of spark DataFrameWriter to 
PARTITIONPATH_FIELD_OPT_KEY
+if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY) && 
!optParams.contains(PARTITIONPATH_FIELD_OPT_KEY)) {
+  val partitionColumns = 
optParams.get(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)
+.map(SparkDataSourceUtils.decodePartitioningColumns)
+.getOrElse(Nil)
+
+  val keyGeneratorClass = 
optParams.getOrElse(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY,
+DataSourceWriteOptions.DEFAULT_KEYGENERATOR_CLASS_OPT_VAL)
+  val partitionPathField =
+keyGeneratorClass match {
+  case "org.apache.hudi.keygen.CustomKeyGenerator" =>
+partitionColumns.map(e => s"$e:SIMPLE").mkString(",")

Review comment:
   @wangxianghu  Thank you for your review. My opinion is this:In 
accordance with the habit of using Spark, the partition field value 
corresponding to partitionBy is the original value, so the default is to use 
SIMPLE. If we automatically infer whether to use TIMESTAMP based on the field 
type, the rules are not easy to determine. For example, if a field is long, we 
Do you need to convert to TIMESTAMP? If you want to convert, but the value is 
not a timestamp, an error will be reported, so SIMPLE is used by default. If 
you want to use TIMESTAMP, users can directly use 
```hoodie.datasource.write.partitionpath. field```Go to specify





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-25 Thread GitBox


wangxianghu commented on a change in pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#discussion_r563623937



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -181,16 +183,33 @@ object DataSourceWriteOptions {
   @Deprecated
   val DEFAULT_STORAGE_TYPE_OPT_VAL = COW_STORAGE_TYPE_OPT_VAL
 
-  def translateStorageTypeToTableType(optParams: Map[String, String]) : 
Map[String, String] = {
+  def translateOptParams(optParams: Map[String, String]): Map[String, String] 
= {
+// translate StorageType to TableType
+var newOptParams = optParams
 if (optParams.contains(STORAGE_TYPE_OPT_KEY) && 
!optParams.contains(TABLE_TYPE_OPT_KEY)) {
   log.warn(STORAGE_TYPE_OPT_KEY + " is deprecated and will be removed in a 
later release; Please use " + TABLE_TYPE_OPT_KEY)
-  optParams ++ Map(TABLE_TYPE_OPT_KEY -> optParams(STORAGE_TYPE_OPT_KEY))
-} else {
-  optParams
+  newOptParams = optParams ++ Map(TABLE_TYPE_OPT_KEY -> 
optParams(STORAGE_TYPE_OPT_KEY))
 }
+// translate the api partitionBy of spark DataFrameWriter to 
PARTITIONPATH_FIELD_OPT_KEY
+if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY) && 
!optParams.contains(PARTITIONPATH_FIELD_OPT_KEY)) {
+  val partitionColumns = 
optParams.get(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)
+.map(SparkDataSourceUtils.decodePartitioningColumns)
+.getOrElse(Nil)
+
+  val keyGeneratorClass = 
optParams.getOrElse(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY,
+DataSourceWriteOptions.DEFAULT_KEYGENERATOR_CLASS_OPT_VAL)
+  val partitionPathField =
+keyGeneratorClass match {
+  case "org.apache.hudi.keygen.CustomKeyGenerator" =>
+partitionColumns.map(e => s"$e:SIMPLE").mkString(",")

Review comment:
   > @wangxianghu Thank you for your review. My opinion is this:In 
accordance with the habit of using Spark, the partition field value 
corresponding to partitionBy is the original value, so the default is to use 
SIMPLE. If we automatically infer whether to use TIMESTAMP based on the field 
type, the rules are not easy to determine. For example, if a field is long, we 
Do you need to convert to TIMESTAMP? If you want to convert, but the value is 
not a timestamp, an error will be reported, so SIMPLE is used by default. If 
you want to use TIMESTAMP, users can directly use 
`hoodie.datasource.write.partitionpath. field`Go to specify
   
   yes, I get your point. we'd better support both `SIMPLE` and `TIMESTAMP` 
type patitionpath in a unified way





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] quitozang opened a new pull request #2486: Filtering abnormal data which the recordKeyField or precombineField is null in avro format

2021-01-25 Thread GitBox


quitozang opened a new pull request #2486:
URL: https://github.com/apache/hudi/pull/2486


   Filtering abnormal data which the recordKeyField or precombineField is null 
in avro format
   
   
   ## What is the purpose of the pull request
   
   If the recordKey field or precombined field of the incoming data is null, 
the DeltaStreamer program will execute an error



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] cadl closed issue #2063: [SUPPORT] change column type from int to long, schema compatibility check failed

2021-01-25 Thread GitBox


cadl closed issue #2063:
URL: https://github.com/apache/hudi/issues/2063


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] teeyog commented on a change in pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-25 Thread GitBox


teeyog commented on a change in pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#discussion_r563665044



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -181,16 +183,33 @@ object DataSourceWriteOptions {
   @Deprecated
   val DEFAULT_STORAGE_TYPE_OPT_VAL = COW_STORAGE_TYPE_OPT_VAL
 
-  def translateStorageTypeToTableType(optParams: Map[String, String]) : 
Map[String, String] = {
+  def translateOptParams(optParams: Map[String, String]): Map[String, String] 
= {
+// translate StorageType to TableType
+var newOptParams = optParams
 if (optParams.contains(STORAGE_TYPE_OPT_KEY) && 
!optParams.contains(TABLE_TYPE_OPT_KEY)) {
   log.warn(STORAGE_TYPE_OPT_KEY + " is deprecated and will be removed in a 
later release; Please use " + TABLE_TYPE_OPT_KEY)
-  optParams ++ Map(TABLE_TYPE_OPT_KEY -> optParams(STORAGE_TYPE_OPT_KEY))
-} else {
-  optParams
+  newOptParams = optParams ++ Map(TABLE_TYPE_OPT_KEY -> 
optParams(STORAGE_TYPE_OPT_KEY))
 }
+// translate the api partitionBy of spark DataFrameWriter to 
PARTITIONPATH_FIELD_OPT_KEY
+if (optParams.contains(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY) && 
!optParams.contains(PARTITIONPATH_FIELD_OPT_KEY)) {
+  val partitionColumns = 
optParams.get(SparkDataSourceUtils.PARTITIONING_COLUMNS_KEY)
+.map(SparkDataSourceUtils.decodePartitioningColumns)
+.getOrElse(Nil)
+
+  val keyGeneratorClass = 
optParams.getOrElse(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY,
+DataSourceWriteOptions.DEFAULT_KEYGENERATOR_CLASS_OPT_VAL)
+  val partitionPathField =
+keyGeneratorClass match {
+  case "org.apache.hudi.keygen.CustomKeyGenerator" =>
+partitionColumns.map(e => s"$e:SIMPLE").mkString(",")

Review comment:
   Yes, now if the parameters include ```TIMESTAMP_TYPE_FIELD_PROP``` and 
```TIMESTAMP_OUTPUT_DATE_FORMAT_PROP```, TIMESTAMP is used by default, 
otherwise SIMPLE





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1205) Serialization fail when log file is larger than 2GB

2021-01-25 Thread Pelucchi Mauro (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271282#comment-17271282
 ] 

Pelucchi Mauro commented on HUDI-1205:
--

[~vbalaji] [~leehuynh] 

Hi guys, When we read data from hudi, we have randomly the error reported here.

We are on EMR 5.31 with hudi 0.6.0 (from AWS).

If we use a snapshot builded from the master branch, it is ok (we don't have 
the issue). But we cannot use hudi 0.7.0 on AWS because we continue to have 
this other issue (related to spark-aws 2.4.6): 
java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V:
 java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
 at org.apache.hudi.MergeOnReadSnapshotRelation
Do you have any hints or workaround?

> Serialization fail when log file is larger than 2GB
> ---
>
> Key: HUDI-1205
> URL: https://issues.apache.org/jira/browse/HUDI-1205
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>
> When scanning the log file, if the log file(or log file group) is larger than 
> 2GB, serialization will fail because Hudi uses Integer to store size in byte 
> for the log file. The maximum integer representing bytes is 2GB.
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> Serialization trace:
> orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload)
> data (org.apache.hudi.common.model.HoodieRecord)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
> at 
> org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107)
> at 
> org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55)
> at 
> org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.

[jira] [Comment Edited] (HUDI-1205) Serialization fail when log file is larger than 2GB

2021-01-25 Thread Pelucchi Mauro (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271282#comment-17271282
 ] 

Pelucchi Mauro edited comment on HUDI-1205 at 1/25/21, 12:34 PM:
-

[~vbalaji] [~leehuynh] 

Hi guys, When we read data from hudi, we have randomly the error reported here.

We are on EMR 5.31 with hudi 0.6.0 (from AWS).

If we use a snapshot builded from the master branch, it is ok (we don't have 
the issue). But we cannot use hudi 0.7.0 on AWS because we continue to have 
this other issue (related to spark-aws 2.4.6): 

java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V:
 java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
 at org.apache.hudi.MergeOnReadSnapshotRelation

Do you have any hints or workaround?


was (Author: mauro.pelucchi):
[~vbalaji] [~leehuynh] 

Hi guys, When we read data from hudi, we have randomly the error reported here.

We are on EMR 5.31 with hudi 0.6.0 (from AWS).

If we use a snapshot builded from the master branch, it is ok (we don't have 
the issue). But we cannot use hudi 0.7.0 on AWS because we continue to have 
this other issue (related to spark-aws 2.4.6): 
java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V:
 java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
 at org.apache.hudi.MergeOnReadSnapshotRelation
Do you have any hints or workaround?

> Serialization fail when log file is larger than 2GB
> ---
>
> Key: HUDI-1205
> URL: https://issues.apache.org/jira/browse/HUDI-1205
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>
> When scanning the log file, if the log file(or log file group) is larger than 
> 2GB, serialization will fail because Hudi uses Integer to store size in byte 
> for the log file. The maximum integer representing bytes is 2GB.
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> Serialization trace:
> orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload)
> data (org.apache.hudi.common.model.HoodieRecord)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
> at 
> org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107)
> at 
> org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55)
> at 
> org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter

[GitHub] [hudi] rubenssoto commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2021-01-25 Thread GitBox


rubenssoto commented on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-766821182


   @vinothchandar 
   
   Thank you so much for your answer.
   When do you plan to release this version? I will try to make some 
workarounds until then.
   
   
   Is this configuration right?
   ```
   { "conf": {
   "spark.jars.packages": "org.apache.spark:spark-avro_2.12:2.4.4",
   "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
   "spark.jars": 
"s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar",
   "spark.sql.hive.convertMetastoreParquet": "false",
   "spark.hadoop.hoodie.metadata.enable": "true"}
   }
   ```
   
   I made these 2 queries:
   
   spark.read.format('hudi').load('s3://ze-data-lake/temp/order_test').count()
   
   
   %%sql 
   select count('*') from raw_courier_api.order_test
   
   On the pyspark query spark creates a job with 143 tasks, after 10 seconds of 
listing the count was fast, but in the spark sql query spark creates a job with 
2000 tasks and was very slow, is it a Hudi or spark issue?
   
   Thank you so much!
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto edited a comment on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2021-01-25 Thread GitBox


rubenssoto edited a comment on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-766821182


   @vinothchandar 
   
   Thank you so much for your answer.
   When do you plan to release this version? I will try to make some 
workarounds until then.
   
   
   Is this configuration right?
   ```
   { "conf": {
   "spark.jars.packages": "org.apache.spark:spark-avro_2.12:2.4.4",
   "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
   "spark.jars": 
"s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar",
   "spark.sql.hive.convertMetastoreParquet": "false",
   "spark.hadoop.hoodie.metadata.enable": "true"}
   }
   ```
   
   I made these 2 queries:
   
   spark.read.format('hudi').load('s3://ze-data-lake/temp/order_test').count()
   
   
   %%sql 
   select count('*') from raw_courier_api.order_test
   
   On the pyspark query spark creates a job with 143 tasks, after 10 seconds of 
listing the count was fast, but in the spark sql query spark creates a job with 
2000 tasks and was very slow, is it a Hudi or spark issue?
   
   SPARK SQL
   https://user-images.githubusercontent.com/36298331/105713972-83bd7a80-5efa-11eb-91e0-b17ca1a3a394.png";>
   
   PYSPARK
   https://user-images.githubusercontent.com/36298331/105714171-ca12d980-5efa-11eb-8a68-97dc880b2671.png";>
   
   
   
   Thank you so much!
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#issuecomment-757929313


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2431?src=pr&el=h1) Report
   > Merging 
[#2431](https://codecov.io/gh/apache/hudi/pull/2431?src=pr&el=desc) (6ad41e4) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c4afd179c1983a382b8a5197d800b0f5dba254de?el=desc)
 (c4afd17) will **increase** coverage by `19.24%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2431/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2431?src=pr&el=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2431   +/-   ##
   =
   + Coverage 50.18%   69.43%   +19.24% 
   + Complexity 3050  357 -2693 
   =
 Files   419   53  -366 
 Lines 18931 1930-17001 
 Branches   1948  230 -1718 
   =
   - Hits   9500 1340 -8160 
   + Misses 8656  456 -8200 
   + Partials775  134  -641 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2431?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...e/hudi/common/engine/HoodieLocalEngineContext.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9Ib29kaWVMb2NhbEVuZ2luZUNvbnRleHQuamF2YQ==)
 | | | |
   | 
[.../org/apache/hudi/MergeOnReadSnapshotRelation.scala](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL01lcmdlT25SZWFkU25hcHNob3RSZWxhdGlvbi5zY2FsYQ==)
 | | | |
   | 
[.../org/apache/hudi/exception/HoodieKeyException.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUtleUV4Y2VwdGlvbi5qYXZh)
 | | | |
   | 
[.../apache/hudi/common/bloom/BloomFilterTypeCode.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0Jsb29tRmlsdGVyVHlwZUNvZGUuamF2YQ==)
 | | | |
   | 
[...able/timeline/versioning/AbstractMigratorBase.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvQWJzdHJhY3RNaWdyYXRvckJhc2UuamF2YQ==)
 | | | |
   | 
[...rc/main/java/org/apache/hudi/cli/HoodiePrompt.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZVByb21wdC5qYXZh)
 | | | |
   | 
[.../org/apache/hudi/common/model/HoodieTableType.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVRhYmxlVHlwZS5qYXZh)
 | | | |
   | 
[.../scala/org/apache/hudi/Spark2RowDeserializer.scala](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmsyL3NyYy9tYWluL3NjYWxhL29yZy9hcGFjaGUvaHVkaS9TcGFyazJSb3dEZXNlcmlhbGl6ZXIuc2NhbGE=)
 | | | |
   | 
[...hudi/common/table/log/block/HoodieDeleteBlock.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVEZWxldGVCbG9jay5qYXZh)
 | | | |
   | 
[...cala/org/apache/hudi/HoodieBootstrapRelation.scala](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUJvb3RzdHJhcFJlbGF0aW9uLnNjYWxh)
 | | | |
   | ... and [354 
more](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr&el=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, ple

[GitHub] [hudi] codecov-io commented on pull request #2486: Filtering abnormal data which the recordKeyField or precombineField is null in avro format

2021-01-25 Thread GitBox


codecov-io commented on pull request #2486:
URL: https://github.com/apache/hudi/pull/2486#issuecomment-766863772


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2486?src=pr&el=h1) Report
   > Merging 
[#2486](https://codecov.io/gh/apache/hudi/pull/2486?src=pr&el=desc) (5476bf0) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c4afd179c1983a382b8a5197d800b0f5dba254de?el=desc)
 (c4afd17) will **decrease** coverage by `1.27%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2486/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2486?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2486  +/-   ##
   
   - Coverage 50.18%   48.90%   -1.28% 
   + Complexity 3050 2155 -895 
   
 Files   419  266 -153 
 Lines 1893112041-6890 
 Branches   1948 1133 -815 
   
   - Hits   9500 5889-3611 
   + Misses 8656 5715-2941 
   + Partials775  437 -338 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.21% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.47% <ø> (-0.03%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `?` | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2486?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | |
   | 
[.../hadoop/realtime/RealtimeUnmergedRecordReader.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lVW5tZXJnZWRSZWNvcmRSZWFkZXIuamF2YQ==)
 | | | |
   | 
[...hudi/utilities/schema/FilebasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9GaWxlYmFzZWRTY2hlbWFQcm92aWRlci5qYXZh)
 | | | |
   | 
[...ties/exception/HoodieIncrementalPullException.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVJbmNyZW1lbnRhbFB1bGxFeGNlcHRpb24uamF2YQ==)
 | | | |
   | 
[...in/java/org/apache/hudi/schema/SchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zY2hlbWEvU2NoZW1hUHJvdmlkZXIuamF2YQ==)
 | | | |
   | 
[...udi/utilities/schema/DelegatingSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9EZWxlZ2F0aW5nU2NoZW1hUHJvdmlkZXIuamF2YQ==)
 | | | |
   | 
[...adoop/realtime/RealtimeBootstrapBaseFileSplit.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lQm9vdHN0cmFwQmFzZUZpbGVTcGxpdC5qYXZh)
 | | | |
   | 
[...in/java/org/apache/hudi/hive/HoodieHiveClient.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSG9vZGllSGl2ZUNsaWVudC5qYXZh)
 | | | |
   | 
[...hadoop/realtime/RealtimeCompactedRecordReader.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lQ29tcGFjdGVkUmVjb3JkUmVhZGVyLmphdmE=)
 | | | |
   | 
[...di/timeline/service/handlers/FileSliceHandler.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvRmlsZVNsaWNlSGFuZGxlci5qYXZh)
 | | | |
   | ... and [142 
more](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree-more) | |
   



This is an automated message from the Apa

[GitHub] [hudi] codecov-io edited a comment on pull request #2486: Filtering abnormal data which the recordKeyField or precombineField is null in avro format

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2486:
URL: https://github.com/apache/hudi/pull/2486#issuecomment-766863772







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2486: Filtering abnormal data which the recordKeyField or precombineField is null in avro format

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2486:
URL: https://github.com/apache/hudi/pull/2486#issuecomment-766863772


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2486?src=pr&el=h1) Report
   > Merging 
[#2486](https://codecov.io/gh/apache/hudi/pull/2486?src=pr&el=desc) (5476bf0) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c4afd179c1983a382b8a5197d800b0f5dba254de?el=desc)
 (c4afd17) will **decrease** coverage by `2.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2486/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2486?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2486  +/-   ##
   
   - Coverage 50.18%   47.98%   -2.20% 
   + Complexity 3050 2693 -357 
   
 Files   419  366  -53 
 Lines 1893117001-1930 
 Branches   1948 1718 -230 
   
   - Hits   9500 8158-1342 
   + Misses 8656 8201 -455 
   + Partials775  642 -133 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.21% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.47% <ø> (-0.03%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `0.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `65.85% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `?` | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2486?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | | | |
   | 
[...g/apache/hudi/utilities/sources/AvroDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb0RGU1NvdXJjZS5qYXZh)
 | | | |
   | 
[...alCheckpointFromAnotherHoodieTimelineProvider.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvSW5pdGlhbENoZWNrcG9pbnRGcm9tQW5vdGhlckhvb2RpZVRpbWVsaW5lUHJvdmlkZXIuamF2YQ==)
 | | | |
   | 
[...callback/kafka/HoodieWriteCommitKafkaCallback.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NhbGxiYWNrL2thZmthL0hvb2RpZVdyaXRlQ29tbWl0S2Fma2FDYWxsYmFjay5qYXZh)
 | | | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | | | |
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | | | |
   | 
[...udi/utilities/transform/FlatteningTransformer.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9GbGF0dGVuaW5nVHJhbnNmb3JtZXIuamF2YQ==)
 | | | |
   | 
[...ties/exception/HoodieIncrementalPullException.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVJbmNyZW1lbnRhbFB1bGxFeGNlcHRpb24uamF2YQ==)
 | | | |
   | 
[...g/apache/hudi/utilities/schema/SchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2486/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlci5qYXZh)
 | | | |
   | ... and [42 
more](https://codecov.io/gh/apache/hudi/pull/24

[GitHub] [hudi] codecov-io edited a comment on pull request #2382: [HUDI-1477] Support CopyOnWriteTable in java client

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2382:
URL: https://github.com/apache/hudi/pull/2382#issuecomment-751367927







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#issuecomment-757929313







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1550) Incorrect query result for MOR table when merge base data with log

2021-01-25 Thread pengzhiwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengzhiwei reassigned HUDI-1550:


Assignee: pengzhiwei

> Incorrect query result for MOR table when merge base data with log
> --
>
> Key: HUDI-1550
> URL: https://issues.apache.org/jira/browse/HUDI-1550
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
> Fix For: 0.8.0
>
>
> Table A is (id: int, value: string, ts: long), where "id" is the record key, 
> "ts" is the precombine key. Update table A which  the follow data:
> {code:java}
> (1, '10', 12)
> (1,'11', 10){code}
>  
> The result of  "select * from A where id = 1" should be *(1, '10', 12)*. 
> However hoodie currently return *(1, '11', 10)* which is not the right answer.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1550) Incorrect query result for MOR table when merge base data with log

2021-01-25 Thread pengzhiwei (Jira)
pengzhiwei created HUDI-1550:


 Summary: Incorrect query result for MOR table when merge base data 
with log
 Key: HUDI-1550
 URL: https://issues.apache.org/jira/browse/HUDI-1550
 Project: Apache Hudi
  Issue Type: Bug
  Components: Spark Integration
Reporter: pengzhiwei
 Fix For: 0.8.0


Table A is (id: int, value: string, ts: long), where "id" is the record key, 
"ts" is the precombine key. Update table A which  the follow data:
{code:java}
(1, '10', 12)
(1,'11', 10){code}
 

The result of  "select * from A where id = 1" should be *(1, '10', 12)*. 

However hoodie currently return *(1, '11', 10)* which is not the right answer.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] rubenssoto edited a comment on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2021-01-25 Thread GitBox


rubenssoto edited a comment on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-766821182


   @vinothchandar 
   
   Thank you so much for your answer.
   When do you plan to release this version? I will try to make some 
workarounds until then.
   
   
   Is this configuration right?
   ```
   { "conf": {
   "spark.jars.packages": "org.apache.spark:spark-avro_2.12:2.4.4",
   "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
   "spark.jars": 
"s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar",
   "spark.sql.hive.convertMetastoreParquet": "false",
   "spark.hadoop.hoodie.metadata.enable": "true"}
   }
   ```
   
   I made these 2 queries:
   
   spark.read.format('hudi').load('s3://ze-data-lake/temp/order_test').count()
   
   
   %%sql 
   select count('*') from raw_courier_api.order_test
   
   On the pyspark query spark creates a job with 143 tasks, after 10 seconds of 
listing the count was fast, but in the spark sql query spark creates a job with 
2000 tasks and was very slow, is it a Hudi or spark issue?
   
   SPARK SQL
   https://user-images.githubusercontent.com/36298331/105713972-83bd7a80-5efa-11eb-91e0-b17ca1a3a394.png";>
   
   PYSPARK
   https://user-images.githubusercontent.com/36298331/105714171-ca12d980-5efa-11eb-8a68-97dc880b2671.png";>
   
   
   Another problem that I got it, my table has 36 million rows, with that 
config shows only 4 million.
   Thank you so much!
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #1829: [SUPPORT] S3 slow file listing causes Hudi read performance.

2021-01-25 Thread GitBox


vinothchandar commented on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-766912406


   0.7.0 is being voted on right now. Hopefully today. 
   
   So the `spark.read.format('hudi')` route (spark datasource path) does not go 
through Hive, so those configs may not help at all. Between pySpark and spark 
datasource in scala, there should be no difference. So not sure whats going on 
:/



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vburenin commented on pull request #2476: [HUDI-1538] Try to init class trying different signatures instead of checking its name

2021-01-25 Thread GitBox


vburenin commented on pull request #2476:
URL: https://github.com/apache/hudi/pull/2476#issuecomment-766947415


   Can anybody merge this PR, please?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2443: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2443:
URL: https://github.com/apache/hudi/pull/2443#issuecomment-760147630







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #2483: [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation

2021-01-25 Thread GitBox


satishkotha commented on a change in pull request #2483:
URL: https://github.com/apache/hudi/pull/2483#discussion_r563962124



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##
@@ -198,6 +198,31 @@ class TestCOWDataSource extends HoodieClientTestBase {
   .mode(SaveMode.Append)
   .save(basePath)
 
+val records2 = recordsToStrings(dataGen.generateInserts("002", 5)).toList
+val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+inputDF2.write.format("org.apache.hudi")
+  .options(commonOpts)
+  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OVERWRITE_OPERATION_OPT_VAL)
+  .mode(SaveMode.Append)
+  .save(basePath)
+
+val metaClient = new 
HoodieTableMetaClient(spark.sparkContext.hadoopConfiguration, basePath, true)
+val commits = 
metaClient.getActiveTimeline.filterCompletedInstants().getInstants.toArray
+  .map(instant => (instant.asInstanceOf[HoodieInstant]).getAction)
+assertEquals(2, commits.size)
+assertEquals("commit", commits(0))
+assertEquals("replacecommit", commits(1))

Review comment:
   Hi, Can you also read back the records and verify that only records2 
show up. (data in records1  doesnt show up)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2438:
URL: https://github.com/apache/hudi/pull/2438#issuecomment-759677298







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2438:
URL: https://github.com/apache/hudi/pull/2438#issuecomment-759677298


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=h1) Report
   > Merging 
[#2438](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=desc) (87125e7) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/c4afd179c1983a382b8a5197d800b0f5dba254de?el=desc)
 (c4afd17) will **decrease** coverage by `6.14%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2438/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2438  +/-   ##
   
   - Coverage 50.18%   44.03%   -6.15% 
   + Complexity 3050 2741 -309 
   
 Files   419  419  
 Lines 1893118949  +18 
 Branches   1948 1953   +5 
   
   - Hits   9500 8345-1155 
   - Misses 8656 9949+1293 
   + Partials775  655 -120 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.21% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.47% <ø> (-0.03%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `0.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `65.85% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `9.59% <0.00%> (-59.84%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `0.00% <0.00%> (-70.51%)` | `0.00 <0.00> (-50.00)` | |
   | 
[...hudi/utilities/sources/helpers/KafkaOffsetGen.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvaGVscGVycy9LYWZrYU9mZnNldEdlbi5qYXZh)
 | `0.00% <0.00%> (-88.78%)` | `0.00 <0.00> (-16.00)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/h

[jira] [Updated] (HUDI-1205) Serialization fail when log file is larger than 2GB

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1205:
-
Labels: user-support-issues  (was: )

> Serialization fail when log file is larger than 2GB
> ---
>
> Key: HUDI-1205
> URL: https://issues.apache.org/jira/browse/HUDI-1205
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: user-support-issues
>
> When scanning the log file, if the log file(or log file group) is larger than 
> 2GB, serialization will fail because Hudi uses Integer to store size in byte 
> for the log file. The maximum integer representing bytes is 2GB.
> Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> Serialization trace:
> orderingVal (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload)
> data (org.apache.hudi.common.model.HoodieRecord)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
> at 
> org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:107)
> at 
> org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:81)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211)
> at 
> org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:168)
> at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55)
> at 
> org.apache.hudi.HoodieMergeOnReadRDD$$anon$1.hasNext(HoodieMergeOnReadRDD.scala:128)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:624)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload$$Lambda$45/62103784
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154)
> ... 31 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1547) CI intermittent failure: TestJsonStringToHoodieRecordMapFunction.testMapFunction

2021-01-25 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271654#comment-17271654
 ] 

Vinoth Chandar commented on HUDI-1547:
--

[~yanghua] [~wangxianghu] can one of you please triage this? and able to take 
over?

> CI intermittent failure: 
> TestJsonStringToHoodieRecordMapFunction.testMapFunction 
> -
>
> Key: HUDI-1547
> URL: https://issues.apache.org/jira/browse/HUDI-1547
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Release & Administrative
>Affects Versions: 0.8.0
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: user-support-issues
>
> [https://github.com/apache/hudi/issues/2467]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-418) Bootstrap Index - Implementation

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-418:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Bootstrap Index - Implementation
> 
>
> Key: HUDI-418
> URL: https://issues.apache.org/jira/browse/HUDI-418
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> An implementation for 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+:+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi#RFC-12:EfficientMigrationofLargeParquetTablestoApacheHudi-BootstrapIndex:]
>  is present in 
> [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-common/src/main/java/org/apache/hudi/common/consolidated/CompositeMapFile.java]
>  
> We need to make it solid with unit-tests and cleanup. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-420) Automated end to end Integration Test

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-420:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Automated end to end Integration Test
> -
>
> Key: HUDI-420
> URL: https://issues.apache.org/jira/browse/HUDI-420
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 72h
>  Remaining Estimate: 0h
>
> We need end to end test as part ITTestHoodieDemo to also include bootstrap 
> table cases.
> We can have a new table bootstrapped from the Hoodie table build in the demo 
> and ensure queries work and return same responses



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-428) Web documentation for explaining how to bootstrap

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-428:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Web documentation for explaining how to bootstrap 
> --
>
> Key: HUDI-428
> URL: https://issues.apache.org/jira/browse/HUDI-428
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Need to provide examples (demo) to document bootstrapping



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-421) Cleanup bootstrap code and create PR for FileStystemView changes

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-421:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Cleanup bootstrap code and create PR for  FileStystemView changes
> -
>
> Key: HUDI-421
> URL: https://issues.apache.org/jira/browse/HUDI-421
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 240h
>  Remaining Estimate: 0h
>
> FileSystemView needs changes to identify and handle bootstrap file slices. 
> Code changes are present in 
> [https://github.com/bvaradar/hudi/tree/vb_bootstrap] Needs cleanup before 
> they are ready to become PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-424) Implement Hive Query Side Integration for querying tables containing bootstrap file slices

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-424:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Implement Hive Query Side Integration for querying tables containing 
> bootstrap file slices
> --
>
> Key: HUDI-424
> URL: https://issues.apache.org/jira/browse/HUDI-424
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 336h
>  Remaining Estimate: 0h
>
> Support for Hive read-optimized and realtime queries 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-422) Cleanup bootstrap code and create write APIs for supporting bootstrap

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-422:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Cleanup bootstrap code and create write APIs for supporting bootstrap 
> --
>
> Key: HUDI-422
> URL: https://issues.apache.org/jira/browse/HUDI-422
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 96h
>  Remaining Estimate: 0h
>
> Once refactor for HoodieWriteClient is done, we can cleanup and introduce 
> HoodieBootstrapClient as a separate PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-423) Implement upsert functionality for handling updates to these bootstrap file slices

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-423:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Implement upsert functionality for handling updates to these bootstrap file 
> slices
> --
>
> Key: HUDI-423
> URL: https://issues.apache.org/jira/browse/HUDI-423
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core, Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>
> Needs support to handle upsert of these file-slices. For MOR tables, also 
> need compaction support. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-425) Implement support for bootstrapping in HoodieDeltaStreamer

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-425:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Implement support for bootstrapping in HoodieDeltaStreamer
> --
>
> Key: HUDI-425
> URL: https://issues.apache.org/jira/browse/HUDI-425
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: help-wanted
> Fix For: 0.6.0
>
>  Time Spent: 168h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-807) Spark DS Support for incremental queries for bootstrapped tables

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-807:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Spark DS Support for incremental queries for bootstrapped tables
> 
>
> Key: HUDI-807
> URL: https://issues.apache.org/jira/browse/HUDI-807
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 120h
>  Remaining Estimate: 0h
>
> Investigate and figure out the changes required in Spark integration code to 
> make incremental queries work seamlessly for bootstrapped tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-806) Implement support for bootstrapping via Spark datasource API

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-806:

Fix Version/s: (was: 0.7.0)
   0.6.0

> Implement support for bootstrapping via Spark datasource API
> 
>
> Key: HUDI-806
> URL: https://issues.apache.org/jira/browse/HUDI-806
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 336h
>  Remaining Estimate: 0h
>
> This Jira tracks the work required to perform bootstrapping through Spark 
> data source API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-242) [RFC-12] Support Efficient bootstrap of large parquet datasets to Hudi

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-242:

Fix Version/s: (was: 0.7.0)
   0.6.0

> [RFC-12] Support Efficient bootstrap of large parquet datasets to Hudi
> --
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar closed pull request #2442: Adding new configurations in 0.7.0

2021-01-25 Thread GitBox


vinothchandar closed pull request #2442:
URL: https://github.com/apache/hudi/pull/2442


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2442: Adding new configurations in 0.7.0

2021-01-25 Thread GitBox


vinothchandar commented on pull request #2442:
URL: https://github.com/apache/hudi/pull/2442#issuecomment-767102394


   Will close this and open a new one



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2111: [HUDI-1234] Insert new records to data files without merging for "Insert" operation.

2021-01-25 Thread GitBox


vinothchandar commented on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-767103157


   @nsivabalan I thought we were going to get this in to 0.7.0? checked back 
again, to see why this was missing



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on pull request #2283: [HUDI-1415] Incorrect query result for hudi hive table when using spa…

2021-01-25 Thread GitBox


rubenssoto commented on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-767117951


   I had the same problem, but I saw less rows not more.
   Reading with spark datasource I have more than 30 million rows and using 
spark sql with hive only 4 million.
   
   I had this problem only these two options are enabled
   
"spark.sql.hive.convertMetastoreParquet": "false"
"spark.hadoop.hoodie.metadata.enable": "true"
   
   @pengzhiwei2018 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan opened a new pull request #2487: [WIP HUDI-53] Adding Record Level Index based on hoodie backed table

2021-01-25 Thread GitBox


nsivabalan opened a new pull request #2487:
URL: https://github.com/apache/hudi/pull/2487


   ## What is the purpose of the pull request
   
   Adding record level index based on hoodie backed table. 
   
   ## Brief change log
   
 - *Added RecordLevelIndex to hoodie that stores and exposes record level 
index info*
   
   Review guide:
   - Index class: RecordLevelIndex
   - Classed used in read path for index table: // Supports read in two modes. 
either scan fully and fetch key locations. or look up one by one
   a. HoodieRecordLevelIndexScanner
   b. HoodieRecordLevelIndexLookupFunction and 
RecordLevelIndexLazyLookupIterator
   - Record schema : HoodieRecordLevelIndexRecord
   - Payload to be used in Index table: HoodieRecordLevelIndexPayload
   - Configs added: hoodie.record.level.index.num.partitions and 
hoodie.record.level.index.enable.seek
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-53) Implement Record level Index to map a record key to a pair #90

2021-01-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-53:
---
Labels: pull-request-available  (was: )

> Implement Record level Index to map a record key to a  FileID> pair #90
> ---
>
> Key: HUDI-53
> URL: https://issues.apache.org/jira/browse/HUDI-53
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> [https://github.com/uber/hudi/issues/90] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] rubenssoto closed issue #2484: [SUPPORT] Hudi Write Performance

2021-01-25 Thread GitBox


rubenssoto closed issue #2484:
URL: https://github.com/apache/hudi/issues/2484


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #2484: [SUPPORT] Hudi Write Performance

2021-01-25 Thread GitBox


rubenssoto commented on issue #2484:
URL: https://github.com/apache/hudi/issues/2484#issuecomment-767143513


   Hello,
   
   I changed the option hoodie.datasource.write.row.writer.enable and took only 
21 minutes, 30% faster, great



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar opened a new pull request #2488: 0.7.0 Doc Revamp

2021-01-25 Thread GitBox


vinothchandar opened a new pull request #2488:
URL: https://github.com/apache/hudi/pull/2488


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #2484: [SUPPORT] Hudi Write Performance

2021-01-25 Thread GitBox


vinothchandar commented on issue #2484:
URL: https://github.com/apache/hudi/issues/2484#issuecomment-767154231


   @rubenssoto yes. row writer is the difference. the `df.rdd` conversion in 
Spark takes that hit. I recommend sorting the file initially, since it gives 
you lots of returns for query performance



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #2484: [SUPPORT] Hudi Write Performance

2021-01-25 Thread GitBox


rubenssoto commented on issue #2484:
URL: https://github.com/apache/hudi/issues/2484#issuecomment-767155405


   Do you mean an Order By before df.write.format('hudi').save()  ?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #2484: [SUPPORT] Hudi Write Performance

2021-01-25 Thread GitBox


vinothchandar commented on issue #2484:
URL: https://github.com/apache/hudi/issues/2484#issuecomment-767157639


   No, I mean the sorting Hudi internally does that you mentioned before. So 
this is not even configurable for row writing. So all good. That should explain 
the extra time `(21-15)`
   
   ```
   return colOrderedDataset
   .sort(functions.col(HoodieRecord.PARTITION_PATH_METADATA_FIELD), 
functions.col(HoodieRecord.RECORD_KEY_METADATA_FIELD))
   .coalesce(config.getBulkInsertShuffleParallelism());
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2488: 0.7.0 Doc Revamp

2021-01-25 Thread GitBox


vinothchandar commented on pull request #2488:
URL: https://github.com/apache/hudi/pull/2488#issuecomment-767158167


   I am going to also cut the release versions for the doc, once I finalize 
everything w.r.t the release. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch release-0.7.0 updated: [MINOR] Update release version to reflect published version ${RELEASE_VERSION}

2021-01-25 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch release-0.7.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/release-0.7.0 by this push:
 new 162dc18  [MINOR] Update release version to reflect published version  
${RELEASE_VERSION}
162dc18 is described below

commit 162dc18fc6a1e1d0db420a4735bc8c5a0ba7cf12
Author: Vinoth Chandar 
AuthorDate: Mon Jan 25 14:44:32 2021 -0800

[MINOR] Update release version to reflect published version  
${RELEASE_VERSION}
---
 docker/hoodie/hadoop/base/pom.xml   | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml   | 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml  | 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml  | 2 +-
 docker/hoodie/hadoop/namenode/pom.xml   | 2 +-
 docker/hoodie/hadoop/pom.xml| 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml| 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml| 2 +-
 hudi-cli/pom.xml| 2 +-
 hudi-client/hudi-client-common/pom.xml  | 4 ++--
 hudi-client/hudi-flink-client/pom.xml   | 4 ++--
 hudi-client/hudi-java-client/pom.xml| 4 ++--
 hudi-client/hudi-spark-client/pom.xml   | 4 ++--
 hudi-client/pom.xml | 2 +-
 hudi-common/pom.xml | 2 +-
 hudi-examples/pom.xml   | 2 +-
 hudi-flink/pom.xml  | 2 +-
 hudi-hadoop-mr/pom.xml  | 2 +-
 hudi-integ-test/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml | 4 ++--
 hudi-spark-datasource/hudi-spark/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark2/pom.xml   | 4 ++--
 hudi-spark-datasource/hudi-spark3/pom.xml   | 4 ++--
 hudi-spark-datasource/pom.xml   | 2 +-
 hudi-sync/hudi-dla-sync/pom.xml | 2 +-
 hudi-sync/hudi-hive-sync/pom.xml| 2 +-
 hudi-sync/hudi-sync-common/pom.xml  | 2 +-
 hudi-sync/pom.xml   | 2 +-
 hudi-timeline-service/pom.xml   | 2 +-
 hudi-utilities/pom.xml  | 2 +-
 packaging/hudi-flink-bundle/pom.xml | 2 +-
 packaging/hudi-hadoop-mr-bundle/pom.xml | 2 +-
 packaging/hudi-hive-sync-bundle/pom.xml | 2 +-
 packaging/hudi-integ-test-bundle/pom.xml| 2 +-
 packaging/hudi-presto-bundle/pom.xml| 2 +-
 packaging/hudi-spark-bundle/pom.xml | 2 +-
 packaging/hudi-timeline-server-bundle/pom.xml   | 2 +-
 packaging/hudi-utilities-bundle/pom.xml | 2 +-
 pom.xml | 2 +-
 42 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/docker/hoodie/hadoop/base/pom.xml 
b/docker/hoodie/hadoop/base/pom.xml
index 27e4f4d..f13cdbc 100644
--- a/docker/hoodie/hadoop/base/pom.xml
+++ b/docker/hoodie/hadoop/base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.7.0-rc2
+0.7.0
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/datanode/pom.xml 
b/docker/hoodie/hadoop/datanode/pom.xml
index 9ec6f37..ab14eb5 100644
--- a/docker/hoodie/hadoop/datanode/pom.xml
+++ b/docker/hoodie/hadoop/datanode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.7.0-rc2
+0.7.0
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/historyserver/pom.xml 
b/docker/hoodie/hadoop/historyserver/pom.xml
index db1442e..6d6d2a1 100644
--- a/docker/hoodie/hadoop/historyserver/pom.xml
+++ b/docker/hoodie/hadoop/historyserver/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.7.0-rc2
+0.7.0
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/hive_base/pom.xml 
b/docker/hoodie/hadoop/hive_base/pom.xml
index 3765068..f9260bf 100644
--- a/docker/hoodie/hadoop/hive_base/pom.xml
+++ b/docker/hoodie/hadoop/hive_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.7.0-rc2
+0.7.0
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/namenode/pom.xml 
b/docker/hoodie/hadoop/namenode/pom.xml
index da10900..6fc3450 100644
--- a/docker/hoodie/hadoop/namenode/pom.xml
+++ b/docker/hoodie/hadoop/namenode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.7.0-rc2
+0.7.0
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/pom.xml b/docker/hoodie/hadoop/pom.xml
index bb5b7c6..3c0aeba 100644
--- a/docker/hoodie/hadoop/pom.xml
+++ b/docker/hoodie/hadoop/pom.xml
@@ -19,7 +19,7 @@
   
 hudi
 org.apache.hudi
-0.7.0-rc2
+0.7.0
 ../../../pom.xml
   
   4.0.0
diff --git a/docker/hoodie/hadoop/prestobase/pom.xml 
b/docker/hoodie/hadoop/pres

[hudi] annotated tag release-0.7.0 updated (162dc18 -> 6ade5f1)

2021-01-25 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to annotated tag release-0.7.0
in repository https://gitbox.apache.org/repos/asf/hudi.git.


*** WARNING: tag release-0.7.0 was modified! ***

from 162dc18  (commit)
  to 6ade5f1  (tag)
 tagging 162dc18fc6a1e1d0db420a4735bc8c5a0ba7cf12 (commit)
 replaces release-0.7.0-rc2
  by Vinoth Chandar
  on Mon Jan 25 14:48:58 2021 -0800

- Log -
0.7.0
-BEGIN PGP SIGNATURE-

iQEzBAABCAAdFiEEfyo765IhgbBqyxqkX30J5YHSvLYFAmAPStoACgkQX30J5YHS
vLasoQf7BC+dy5i2lf0/ZJxiAEJvOLFc09sjqboq5ACAsf4oXJudJP/OHni0cwVj
3AmSuZoKd8R1ihS0mBwO9NxblghEapUNdWM1jd6fv+E69csStKXlHaNqAaRvWbhy
4w3s0JwAx7RWm5YFwdeGpb1GILt7HSsYDTSI/bg6xacKYIpvLzXoSaG6TB/dQ7d0
pVfcCcthkYwjuGgrkKSP8UMKl8QlKJF/D8NZrCLP7GBpTXILMjeYxZlx8qDPIzch
zDqX1KGd0VWpTQ03YjAYjDmk0PsFgvfPe7JSerdgHWUORt19n3fsO3W8ZJMgAck8
U5bfM0fIpC3Bjbmx+QEArIY5e17fUQ==
=jB8Z
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:



[GitHub] [hudi] rubenssoto commented on issue #2484: [SUPPORT] Hudi Write Performance

2021-01-25 Thread GitBox


rubenssoto commented on issue #2484:
URL: https://github.com/apache/hudi/issues/2484#issuecomment-767173123


   Great, thank you for the explanationits makes sense.
   
   If I understand this code right, Hudi will order by partition key and record 
key, so if I have an unpartitioned table the data on my files should be ordered 
by record key(primary key), is it?
   
   I check one table of mine, and it is my min and max of my primary key of 
each file
   https://user-images.githubusercontent.com/36298331/105777653-83000500-5f49-11eb-8243-a28257e26996.png";>
   
   
   Using that code as a reference, I imagined the files like:
   
   file01   1  1000
   file02  1001 2000
   file03  2001 3000
   ... and so on
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




svn commit: r45594 - in /release/hudi/hudi-0.7.0: ./ hudi-0.7.0.src.tgz hudi-0.7.0.src.tgz.asc hudi-0.7.0.src.tgz.sha512

2021-01-25 Thread vinoth
Author: vinoth
Date: Tue Jan 26 00:13:41 2021
New Revision: 45594

Log:
Hudi 0.7.0

Added:
release/hudi/hudi-0.7.0/
release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz   (with props)
release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz.asc
release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz.sha512

Added: release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz
==
Binary file - no diff available.

Propchange: release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz
--
svn:mime-type = application/octet-stream

Added: release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz.asc
==
--- release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz.asc (added)
+++ release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz.asc Tue Jan 26 00:13:41 2021
@@ -0,0 +1,11 @@
+-BEGIN PGP SIGNATURE-
+
+iQEzBAABCAAdFiEEfyo765IhgbBqyxqkX30J5YHSvLYFAmAPXDUACgkQX30J5YHS
+vLbBKgf/WYGDN8El4ySq6O2kekBCiUk9HgC1C90i2UhHUKq/hBUmcYxxIUdFmEif
+lffTXuoPOCCAlM0g5abAZDf3GmL04HCWEOcpW5ni0tZcagvz1FeMUN8EfgCsbVMZ
+OU2CuYLE9R+nuvX/qsnk/BqZr5rzgZc/stl31ryLg0MYBfodpFf9xkgKDV13L3Fp
+mh1+XgJTT9d4OtKL50xstfa/Ddo4EIoRA9FFr+ZCiWoWVkYrvc9YN7QYdp/J8l7N
+EinF28kNMje13pdcHG2APdJBk6qpFqUjIuOtJk2FkliWCs6CggPw6j0Gk37ukbtG
+6JsxFhFomvriui8JYwpB7a9yu6iedQ==
+=1K/M
+-END PGP SIGNATURE-

Added: release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz.sha512
==
--- release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz.sha512 (added)
+++ release/hudi/hudi-0.7.0/hudi-0.7.0.src.tgz.sha512 Tue Jan 26 00:13:41 2021
@@ -0,0 +1 @@
+4940bd82ec27be5688f483dd03087586a860612db25619c937e158853e585a59e73fd0fe2ba6cec72f83e6ceb8964fb8f39e666f44044900dc4e008533de78f5
  hudi-0.7.0.src.tgz




[jira] [Reopened] (HUDI-1502) Restore on MOR table leaves metadata table out-of-sync from data table

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1502:
--

> Restore on MOR table leaves metadata table out-of-sync from data table
> --
>
> Key: HUDI-1502
> URL: https://issues.apache.org/jira/browse/HUDI-1502
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2021-01-03-22-48-54-646.png
>
>
> Below is the stack trace from running `TestHoodieBackedMetadata#testSync` on 
> MOR tables. This seems like a more fundamental issue with deleting instant 
> files, during restore. 
> So what happens is that we restore which rolls back a delta commit that has 
> not been synced yet. (20210103224054 in the e.g) And that delta commit has 
> introduced a new log file, which has not been added to the metadata table. 
> But the restore effectively deletes the 20210103224054.deltacommit. 
> {code}
> Commit 20210103224042 added HoodieKey { recordKey=2016/03/15 
> partitionPath=files} HoodieMetadataPayload {key=2016/03/15, type=2, 
> creations=[6b8f2187-5505-40ae-845e-a71a2163d064-0_4-2-6_20210103224041.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/16 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/16, type=2, 
> creations=[028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet,
>  25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet], 
> deletions=[], }
>   HoodieKey { recordKey=2015/03/17 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/17, type=2, 
> creations=[2ab899de-4745-43c5-9fa4-d09721d3aa91-0_1-2-3_20210103224041.parquet,
>  4733dbda-7824-4411-a708-4b2d978f887b-0_4-9-22_20210103224042.parquet, 
> 532a6f9b-ca89-4b96-84b7-0e3b13068b4b-0_3-9-21_20210103224042.parquet, 
> 6842e596-46b3-4546-9faa-8a7f8c674a17-0_0-2-2_20210103224041.parquet, 
> 7f0635d7-126e-40b6-9677-7fd8a123d5b9-0_3-2-5_20210103224041.parquet, 
> d1906fdc-66ca-48a4-86b6-687c865d939d-0_2-9-20_20210103224042.parquet, 
> fd446460-a662-434a-a6ab-1cd498af94ca-0_2-2-4_20210103224041.parquet], 
> deletions=[], }
>   HoodieKey { recordKey=__all_partitions__ partitionPath=files} 
> HoodieMetadataPayload {key=__all_partitions__, type=1, creations=[2015/03/16, 
> 2015/03/17, 2016/03/15], deletions=[], } 
>  Syncing [20210103224045__deltacommit__COMPLETED] to metadata table.
> Commit 20210103224045 added HoodieKey { recordKey=2016/03/15 
> partitionPath=files} HoodieMetadataPayload {key=2016/03/15, type=2, 
> creations=[6b8f2187-5505-40ae-845e-a71a2163d064-0_0-31-52_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/16 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/16, type=2, 
> creations=[25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/17 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/17, type=2, 
> creations=[2ab899de-4745-43c5-9fa4-d09721d3aa91-0_2-31-54_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=__all_partitions__ partitionPath=files} 
> HoodieMetadataPayload {key=__all_partitions__, type=1, creations=[2015/03/16, 
> 2015/03/17, 2016/03/15], deletions=[], } >>> (after compaction) State at 
> 20210103224051 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224042.log.1_0-100-148 
>028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after delete) State at 20210103224052 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224042.log.1_0-100-148 
>028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after clean) State at 20210103224053 files 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after update) State at 20210103224054 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224051.log.1_1-160-262 
>.25c9a174-4c07-43a1-a1a2-40454a3f0310-0_20210103224045.log.1_2-160-263 
>
> 028cc15e-85ef-4b6f-b

[jira] [Resolved] (HUDI-1502) Restore on MOR table leaves metadata table out-of-sync from data table

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1502.
--
Resolution: Fixed

> Restore on MOR table leaves metadata table out-of-sync from data table
> --
>
> Key: HUDI-1502
> URL: https://issues.apache.org/jira/browse/HUDI-1502
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2021-01-03-22-48-54-646.png
>
>
> Below is the stack trace from running `TestHoodieBackedMetadata#testSync` on 
> MOR tables. This seems like a more fundamental issue with deleting instant 
> files, during restore. 
> So what happens is that we restore which rolls back a delta commit that has 
> not been synced yet. (20210103224054 in the e.g) And that delta commit has 
> introduced a new log file, which has not been added to the metadata table. 
> But the restore effectively deletes the 20210103224054.deltacommit. 
> {code}
> Commit 20210103224042 added HoodieKey { recordKey=2016/03/15 
> partitionPath=files} HoodieMetadataPayload {key=2016/03/15, type=2, 
> creations=[6b8f2187-5505-40ae-845e-a71a2163d064-0_4-2-6_20210103224041.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/16 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/16, type=2, 
> creations=[028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet,
>  25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet], 
> deletions=[], }
>   HoodieKey { recordKey=2015/03/17 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/17, type=2, 
> creations=[2ab899de-4745-43c5-9fa4-d09721d3aa91-0_1-2-3_20210103224041.parquet,
>  4733dbda-7824-4411-a708-4b2d978f887b-0_4-9-22_20210103224042.parquet, 
> 532a6f9b-ca89-4b96-84b7-0e3b13068b4b-0_3-9-21_20210103224042.parquet, 
> 6842e596-46b3-4546-9faa-8a7f8c674a17-0_0-2-2_20210103224041.parquet, 
> 7f0635d7-126e-40b6-9677-7fd8a123d5b9-0_3-2-5_20210103224041.parquet, 
> d1906fdc-66ca-48a4-86b6-687c865d939d-0_2-9-20_20210103224042.parquet, 
> fd446460-a662-434a-a6ab-1cd498af94ca-0_2-2-4_20210103224041.parquet], 
> deletions=[], }
>   HoodieKey { recordKey=__all_partitions__ partitionPath=files} 
> HoodieMetadataPayload {key=__all_partitions__, type=1, creations=[2015/03/16, 
> 2015/03/17, 2016/03/15], deletions=[], } 
>  Syncing [20210103224045__deltacommit__COMPLETED] to metadata table.
> Commit 20210103224045 added HoodieKey { recordKey=2016/03/15 
> partitionPath=files} HoodieMetadataPayload {key=2016/03/15, type=2, 
> creations=[6b8f2187-5505-40ae-845e-a71a2163d064-0_0-31-52_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/16 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/16, type=2, 
> creations=[25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/17 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/17, type=2, 
> creations=[2ab899de-4745-43c5-9fa4-d09721d3aa91-0_2-31-54_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=__all_partitions__ partitionPath=files} 
> HoodieMetadataPayload {key=__all_partitions__, type=1, creations=[2015/03/16, 
> 2015/03/17, 2016/03/15], deletions=[], } >>> (after compaction) State at 
> 20210103224051 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224042.log.1_0-100-148 
>028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after delete) State at 20210103224052 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224042.log.1_0-100-148 
>028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after clean) State at 20210103224053 files 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after update) State at 20210103224054 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224051.log.1_1-160-262 
>.25c9a174-4c07-43a1-a1a2-40454a3f0310-0_20210103224045.log.1_2-160-263 
>

[jira] [Resolved] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1435.
--
Resolution: Fixed

> Marker File Reconciliation failing for Non-Partitioned datasets when 
> duplicate marker files present
> ---
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1435:
--

> Marker File Reconciliation failing for Non-Partitioned datasets when 
> duplicate marker files present
> ---
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1351) Improvements required to hudi-test-suite for scalable and repeated testing

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1351.
--
Resolution: Fixed

> Improvements required to hudi-test-suite for scalable and repeated testing
> --
>
> Key: HUDI-1351
> URL: https://issues.apache.org/jira/browse/HUDI-1351
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> There are some shortcomings of the hudi-test-suite which would be good to fix:
> 1. When doing repeated testing with the same DAG, the input and output 
> directories need to be manually cleaned. This is cumbersome for repeated 
> testing.
> 2. When running a long test, the input data generated by older DAG nodes is 
> not deleted and leads to high file count on the HDFS cluster. The older files 
> can be deleted once the data has been ingested.
> 3. When generating input data, if the number of insert/update partitions is 
> less than spark's default parallelism, a number of empty avro files are 
> created. This also leads to scalability issues on the HDFS cluster. Creating 
> large number of smaller AVRO files is slower and less scalable than single 
> AVRO file.
> 4. When generating data to be inserted, we cannot control which partition the 
> data will be generated for or add a new partition. Hence we need a 
> start_offset parameter to control the partition offset.
> 5. BUG: Does not generate correct number of insert partitions as partition 
> number is chosen as a random long. 
> 6. BUG: Integer division used within Math.ceil in a couple of places is not 
> correct and leads to 0 value.  Math.ceil(5/10) == 0 and not 1 (as intended) 
> as 5 and 10 are integers.
>  
> 1. When generating input data, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1302) Add support for timestamp field in HiveSync

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1302.
--
Resolution: Fixed

> Add support for timestamp field in HiveSync
> ---
>
> Key: HUDI-1302
> URL: https://issues.apache.org/jira/browse/HUDI-1302
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Hudi HiveSyncTool converts int64 fields to 'bigint' hive type.  
> If the field has OriginalType as 'TIMESTAMP_MICROS', the field needs to 
> converted into 'timestamp' hive type.
> This has to be done in backward compatible way, so already synced tables will 
> continue to get hive type as 'bigint'. We can enable 'timestamp' conversion 
> manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1302) Add support for timestamp field in HiveSync

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1302:
--

> Add support for timestamp field in HiveSync
> ---
>
> Key: HUDI-1302
> URL: https://issues.apache.org/jira/browse/HUDI-1302
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Hudi HiveSyncTool converts int64 fields to 'bigint' hive type.  
> If the field has OriginalType as 'TIMESTAMP_MICROS', the field needs to 
> converted into 'timestamp' hive type.
> This has to be done in backward compatible way, so already synced tables will 
> continue to get hive type as 'bigint'. We can enable 'timestamp' conversion 
> manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1301) use spark INCREMENTAL mode query hudi dataset support schema version

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1301:
--

> use spark INCREMENTAL mode query hudi  dataset support schema version
> -
>
> Key: HUDI-1301
> URL: https://issues.apache.org/jira/browse/HUDI-1301
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> 一、issue
> 1、 at write hand ,write two commit , second commit add a column such as:
> commit1 schema and data
> id , name 
> 1, lisi
>  
> commit2  schema and data
> id, name , age
> 2, zhangsan, 18
>  
> 2、at read hand,
> read the latest commit return
> id, name , age
> 1, lisi, null
> 2, zhangsan, 18
>  
> read the first commit by set  END_INSTANTTIME_OPT_KEY to first commit, will 
> return 
> id, name , age
> 1, lisi, null
>  
> 二、solution
> we can see that read the first commit alse return "age" column. i think if   
> set  END_INSTANTTIME_OPT_KEY to first commit,  both schema and data should 
> with that commit.
>  more clearness should return 
> id, name 
> 1, lisi
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1298) Add better error messages when IOException occurs during log file reading

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1298:
--

> Add better error messages when IOException occurs during log file reading
> -
>
> Key: HUDI-1298
> URL: https://issues.apache.org/jira/browse/HUDI-1298
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> We many times notice that while reading the log files, we get the following 
> exception 
> `org.apache.hudi.exception.HoodieIOException:IOException when reading log 
> file`
>  
> One example of such exception seen is also in this github issue -> 
> https://github.com/apache/hudi/issues/2104



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1301) use spark INCREMENTAL mode query hudi dataset support schema version

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1301.
--
Resolution: Fixed

> use spark INCREMENTAL mode query hudi  dataset support schema version
> -
>
> Key: HUDI-1301
> URL: https://issues.apache.org/jira/browse/HUDI-1301
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> 一、issue
> 1、 at write hand ,write two commit , second commit add a column such as:
> commit1 schema and data
> id , name 
> 1, lisi
>  
> commit2  schema and data
> id, name , age
> 2, zhangsan, 18
>  
> 2、at read hand,
> read the latest commit return
> id, name , age
> 1, lisi, null
> 2, zhangsan, 18
>  
> read the first commit by set  END_INSTANTTIME_OPT_KEY to first commit, will 
> return 
> id, name , age
> 1, lisi, null
>  
> 二、solution
> we can see that read the first commit alse return "age" column. i think if   
> set  END_INSTANTTIME_OPT_KEY to first commit,  both schema and data should 
> with that commit.
>  more clearness should return 
> id, name 
> 1, lisi
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1351) Improvements required to hudi-test-suite for scalable and repeated testing

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1351:
--

> Improvements required to hudi-test-suite for scalable and repeated testing
> --
>
> Key: HUDI-1351
> URL: https://issues.apache.org/jira/browse/HUDI-1351
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> There are some shortcomings of the hudi-test-suite which would be good to fix:
> 1. When doing repeated testing with the same DAG, the input and output 
> directories need to be manually cleaned. This is cumbersome for repeated 
> testing.
> 2. When running a long test, the input data generated by older DAG nodes is 
> not deleted and leads to high file count on the HDFS cluster. The older files 
> can be deleted once the data has been ingested.
> 3. When generating input data, if the number of insert/update partitions is 
> less than spark's default parallelism, a number of empty avro files are 
> created. This also leads to scalability issues on the HDFS cluster. Creating 
> large number of smaller AVRO files is slower and less scalable than single 
> AVRO file.
> 4. When generating data to be inserted, we cannot control which partition the 
> data will be generated for or add a new partition. Hence we need a 
> start_offset parameter to control the partition offset.
> 5. BUG: Does not generate correct number of insert partitions as partition 
> number is chosen as a random long. 
> 6. BUG: Integer division used within Math.ceil in a couple of places is not 
> correct and leads to 0 value.  Math.ceil(5/10) == 0 and not 1 (as intended) 
> as 5 and 10 are integers.
>  
> 1. When generating input data, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1298) Add better error messages when IOException occurs during log file reading

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1298.
--
Resolution: Fixed

> Add better error messages when IOException occurs during log file reading
> -
>
> Key: HUDI-1298
> URL: https://issues.apache.org/jira/browse/HUDI-1298
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> We many times notice that while reading the log files, we get the following 
> exception 
> `org.apache.hudi.exception.HoodieIOException:IOException when reading log 
> file`
>  
> One example of such exception seen is also in this github issue -> 
> https://github.com/apache/hudi/issues/2104



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1153) Spark DataSource and Streaming Write must fail when operation type is misconfigured

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1153:
--

> Spark DataSource and Streaming Write must fail when operation type is 
> misconfigured
> ---
>
> Key: HUDI-1153
> URL: https://issues.apache.org/jira/browse/HUDI-1153
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Context: [https://github.com/apache/hudi/issues/1902#issuecomment-669698259]
>  
> If you look at DataSourceUtils.java, 
> [https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L257]
>  
> we are string comparison to determine operation type which is a bad idea and 
> a typo could result in "upsert" being used silently. 
>  
> Just like 
> [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L187]
>  being used for DeltaStreamer, we need similar enums defined in 
> DataSourceOptions.scala for OPERATION_OPT_KEY but care must be taken to 
> ensure we do not cause backwards compatibility issue by changing the property 
> value. In other words, we need to retain the lower case values 
> ("bulk_insert", "insert" and "upsert") but make it an enum. 
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1228) create a utility to query extra metadata

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1228.
--
Resolution: Fixed

> create a utility to query extra metadata
> 
>
> Key: HUDI-1228
> URL: https://issues.apache.org/jira/browse/HUDI-1228
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> We have many usecases, where users want to store extra metadata as part of 
> commit and want to read this extra metadata for other usecases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-791) Replace null by Option in Delta Streamer

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-791:
-

> Replace null by Option in Delta Streamer
> 
>
> Key: HUDI-791
> URL: https://issues.apache.org/jira/browse/HUDI-791
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, newbie
>Reporter: Yanjia Gary Li
>Assignee: liwei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> There is a lot of null in Delta Streamer. That will be great if we can 
> replace those null by Option. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1226) ComplexKeyGenerator doesnt work for non partitioned tables

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1226:
--

> ComplexKeyGenerator doesnt work for non partitioned tables
> --
>
> Key: HUDI-1226
> URL: https://issues.apache.org/jira/browse/HUDI-1226
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> 1) If we pass empty string(-hoodie-conf 
> hoodie.datasource.write.partitionpath.field=), generator  returns 'default' 
> as partitionpath
> 2) if we pass delimiter alone (-hoodie-conf 
> hoodie.datasource.write.partitionpath.field=,), it throws 
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>   at 
> java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:824)
>   at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:253)
>   at 
> org.apache.hudi.keygen.KeyGenUtils.getRecordPartitionPath(KeyGenUtils.java:80)
>   at 
> org.apache.hudi.keygen.ComplexKeyGenerator.getPartitionPath(ComplexKeyGenerator.java:52)
>   at 
> org.apache.hudi.keygen.BuiltinKeyGenerator.getKey(BuiltinKeyGenerator.java:75)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-842) RFC-15 : Implementation of File Listing elimination

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-842.
-
Resolution: Fixed

> RFC-15 : Implementation of File Listing elimination
> ---
>
> Key: HUDI-842
> URL: https://issues.apache.org/jira/browse/HUDI-842
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> This is an umbrella task which tracks the implementation of [RFC 15 - File 
> Listing and Query Planning 
> Improvements|[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1153) Spark DataSource and Streaming Write must fail when operation type is misconfigured

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1153.
--
Resolution: Fixed

> Spark DataSource and Streaming Write must fail when operation type is 
> misconfigured
> ---
>
> Key: HUDI-1153
> URL: https://issues.apache.org/jira/browse/HUDI-1153
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Context: [https://github.com/apache/hudi/issues/1902#issuecomment-669698259]
>  
> If you look at DataSourceUtils.java, 
> [https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L257]
>  
> we are string comparison to determine operation type which is a bad idea and 
> a typo could result in "upsert" being used silently. 
>  
> Just like 
> [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L187]
>  being used for DeltaStreamer, we need similar enums defined in 
> DataSourceOptions.scala for OPERATION_OPT_KEY but care must be taken to 
> ensure we do not cause backwards compatibility issue by changing the property 
> value. In other words, we need to retain the lower case values 
> ("bulk_insert", "insert" and "upsert") but make it an enum. 
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1191) create incremental meta client abstraction to query modified partitions

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1191.
--
Resolution: Fixed

> create incremental meta client abstraction to query modified partitions
> ---
>
> Key: HUDI-1191
> URL: https://issues.apache.org/jira/browse/HUDI-1191
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Create incremental client abstraction to query modified partitions for a 
> timeline.
> This can be reused in HiveSync and InputFormats. We also need this as an API 
> for other usecases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1228) create a utility to query extra metadata

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1228:
--

> create a utility to query extra metadata
> 
>
> Key: HUDI-1228
> URL: https://issues.apache.org/jira/browse/HUDI-1228
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> We have many usecases, where users want to store extra metadata as part of 
> commit and want to read this extra metadata for other usecases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-842) RFC-15 : Implementation of File Listing elimination

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-842:
-

> RFC-15 : Implementation of File Listing elimination
> ---
>
> Key: HUDI-842
> URL: https://issues.apache.org/jira/browse/HUDI-842
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> This is an umbrella task which tracks the implementation of [RFC 15 - File 
> Listing and Query Planning 
> Improvements|[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1226) ComplexKeyGenerator doesnt work for non partitioned tables

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1226.
--
Resolution: Fixed

> ComplexKeyGenerator doesnt work for non partitioned tables
> --
>
> Key: HUDI-1226
> URL: https://issues.apache.org/jira/browse/HUDI-1226
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> 1) If we pass empty string(-hoodie-conf 
> hoodie.datasource.write.partitionpath.field=), generator  returns 'default' 
> as partitionpath
> 2) if we pass delimiter alone (-hoodie-conf 
> hoodie.datasource.write.partitionpath.field=,), it throws 
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>   at 
> java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:824)
>   at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:253)
>   at 
> org.apache.hudi.keygen.KeyGenUtils.getRecordPartitionPath(KeyGenUtils.java:80)
>   at 
> org.apache.hudi.keygen.ComplexKeyGenerator.getPartitionPath(ComplexKeyGenerator.java:52)
>   at 
> org.apache.hudi.keygen.BuiltinKeyGenerator.getKey(BuiltinKeyGenerator.java:75)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-791) Replace null by Option in Delta Streamer

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-791.
-
Resolution: Fixed

> Replace null by Option in Delta Streamer
> 
>
> Key: HUDI-791
> URL: https://issues.apache.org/jira/browse/HUDI-791
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, newbie
>Reporter: Yanjia Gary Li
>Assignee: liwei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> There is a lot of null in Delta Streamer. That will be great if we can 
> replace those null by Option. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1191) create incremental meta client abstraction to query modified partitions

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-1191:
--

> create incremental meta client abstraction to query modified partitions
> ---
>
> Key: HUDI-1191
> URL: https://issues.apache.org/jira/browse/HUDI-1191
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Create incremental client abstraction to query modified partitions for a 
> timeline.
> This can be reused in HiveSync and InputFormats. We also need this as an API 
> for other usecases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-575:
-

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 
>  
> We need to 
>  * Enable configuring async compaction for streaming writes 
>  * Implement a parallel compaction process like we did for delta streamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-284) Need Tests for Hudi handling of schema evolution

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-284.
-
Resolution: Fixed

> Need  Tests for Hudi handling of schema evolution
> -
>
> Key: HUDI-284
> URL: https://issues.apache.org/jira/browse/HUDI-284
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Common Core, newbie, Testing
>Reporter: Balaji Varadarajan
>Assignee: liwei
>Priority: Major
>  Labels: help-wanted, pull-request-available, starter
> Fix For: 0.7.0
>
>
> Context in : 
> https://github.com/apache/incubator-hudi/pull/927#pullrequestreview-293449514



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-284) Need Tests for Hudi handling of schema evolution

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-284:
-

> Need  Tests for Hudi handling of schema evolution
> -
>
> Key: HUDI-284
> URL: https://issues.apache.org/jira/browse/HUDI-284
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Common Core, newbie, Testing
>Reporter: Balaji Varadarajan
>Assignee: liwei
>Priority: Major
>  Labels: help-wanted, pull-request-available, starter
> Fix For: 0.7.0
>
>
> Context in : 
> https://github.com/apache/incubator-hudi/pull/927#pullrequestreview-293449514



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2021-01-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-575.
-
Resolution: Fixed

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 
>  
> We need to 
>  * Enable configuring async compaction for streaming writes 
>  * Implement a parallel compaction process like we did for delta streamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys

2021-01-25 Thread GitBox


nsivabalan commented on issue #2013:
URL: https://github.com/apache/hudi/issues/2013#issuecomment-767204986


   @garyli1019 : can you give any updates you have on on this regard. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1982: [SUPPORT] Not able to write to ADLS Gen2 in Azure Databricks, with error has invalid authority.

2021-01-25 Thread GitBox


nsivabalan commented on issue #1982:
URL: https://github.com/apache/hudi/issues/1982#issuecomment-767205667


   @Ac-Rush : would you mind update the ticket. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena

2021-01-25 Thread GitBox


nsivabalan commented on issue #1981:
URL: https://github.com/apache/hudi/issues/1981#issuecomment-767206596


   @vinothchandar @umehrot2 : can either of you respond here wrt metadata 
support(rfc-15) in Athena. when can we possibly expect. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1971: Schema evoluation causes issue when using kafka source in hudi deltastreamer

2021-01-25 Thread GitBox


nsivabalan commented on issue #1971:
URL: https://github.com/apache/hudi/issues/1971#issuecomment-767208636


   @jingweiz2017 : can you please check above response and let us know if you 
need anything more from Hudi community. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1962: [SUPPORT] Unable to filter hudi table in hive on partition column

2021-01-25 Thread GitBox


nsivabalan commented on issue #1962:
URL: https://github.com/apache/hudi/issues/1962#issuecomment-767209175


   @bvaradar : guess you missed to follow up on this thread. can you check it 
out and respond when you can. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #2487: [WIP HUDI-53] Adding Record Level Index based on hoodie backed table

2021-01-25 Thread GitBox


nsivabalan commented on a change in pull request #2487:
URL: https://github.com/apache/hudi/pull/2487#discussion_r564142151



##
File path: 
hudi-common/src/main/java/org/apache/hudi/index/HoodieRecordLevelIndexPayload.java
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index;
+
+import org.apache.hudi.avro.model.HoodieRecordLevelIndexRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.io.IOException;
+
+/**
+ * Payload used in index table for Hoodie Record level index.
+ */
+public class HoodieRecordLevelIndexPayload implements 
HoodieRecordPayload {
+
+  private String key;
+  private String partitionPath;
+  private String instantTime;
+  private String fileId;
+
+  public HoodieRecordLevelIndexPayload(Option record) {
+if (record.isPresent()) {
+  // This can be simplified using SpecificData.deepcopy once this bug is 
fixed
+  // https://issues.apache.org/jira/browse/AVRO-1811
+  key = record.get().get("key").toString();
+  partitionPath = record.get().get("partitionPath").toString();
+  instantTime = record.get().get("instantTime").toString();
+  fileId = record.get().get("fileId").toString();
+}
+  }
+
+  private HoodieRecordLevelIndexPayload(String key, String partitionPath, 
String instantTime, String fileId) {
+this.key = key;
+this.partitionPath = partitionPath;
+this.instantTime = instantTime;
+this.fileId = fileId;
+  }
+
+  @Override
+  public HoodieRecordLevelIndexPayload 
preCombine(HoodieRecordLevelIndexPayload another) {
+if (this.instantTime.compareTo(another.instantTime) >= 0) {

Review comment:
   Note: this needs some fixing . Can we just convert the string to long 
and compare. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

2021-01-25 Thread GitBox


nsivabalan commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-767210126


   https://github.com/apache/hudi/pull/1978 have fixed it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

2021-01-25 Thread GitBox


nsivabalan closed issue #1958:
URL: https://github.com/apache/hudi/issues/1958


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1547) CI intermittent failure: TestJsonStringToHoodieRecordMapFunction.testMapFunction

2021-01-25 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu reassigned HUDI-1547:
-

Assignee: wangxianghu

> CI intermittent failure: 
> TestJsonStringToHoodieRecordMapFunction.testMapFunction 
> -
>
> Key: HUDI-1547
> URL: https://issues.apache.org/jira/browse/HUDI-1547
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Release & Administrative
>Affects Versions: 0.8.0
>Reporter: sivabalan narayanan
>Assignee: wangxianghu
>Priority: Major
>  Labels: user-support-issues
>
> [https://github.com/apache/hudi/issues/2467]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1547) CI intermittent failure: TestJsonStringToHoodieRecordMapFunction.testMapFunction

2021-01-25 Thread wangxianghu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271766#comment-17271766
 ] 

wangxianghu commented on HUDI-1547:
---

[~vinoth] I can take it

> CI intermittent failure: 
> TestJsonStringToHoodieRecordMapFunction.testMapFunction 
> -
>
> Key: HUDI-1547
> URL: https://issues.apache.org/jira/browse/HUDI-1547
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Release & Administrative
>Affects Versions: 0.8.0
>Reporter: sivabalan narayanan
>Assignee: wangxianghu
>Priority: Major
>  Labels: user-support-issues
>
> [https://github.com/apache/hudi/issues/2467]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


svn commit: r45595 - in /release/hudi: 0.7.0/ hudi-0.7.0/

2021-01-25 Thread vinoth
Author: vinoth
Date: Tue Jan 26 01:37:48 2021
New Revision: 45595

Log:
Renaming for Hudi 0.7.0

Added:
release/hudi/0.7.0/
  - copied from r45594, release/hudi/hudi-0.7.0/
Removed:
release/hudi/hudi-0.7.0/



[GitHub] [hudi] codecov-io commented on pull request #2487: [WIP HUDI-53] Adding Record Level Index based on hoodie backed table

2021-01-25 Thread GitBox


codecov-io commented on pull request #2487:
URL: https://github.com/apache/hudi/pull/2487#issuecomment-767228748


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2487?src=pr&el=h1) Report
   > Merging 
[#2487](https://codecov.io/gh/apache/hudi/pull/2487?src=pr&el=desc) (8b07157) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/e302c6bc12c7eb764781898fdee8ee302ef4ec10?el=desc)
 (e302c6b) will **increase** coverage by `19.24%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2487/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2487?src=pr&el=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2487   +/-   ##
   =
   + Coverage 50.18%   69.43%   +19.24% 
   + Complexity 3050  357 -2693 
   =
 Files   419   53  -366 
 Lines 18931 1930-17001 
 Branches   1948  230 -1718 
   =
   - Hits   9500 1340 -8160 
   + Misses 8656  456 -8200 
   + Partials775  134  -641 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2487?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...e/hudi/common/engine/HoodieLocalEngineContext.java](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9Ib29kaWVMb2NhbEVuZ2luZUNvbnRleHQuamF2YQ==)
 | | | |
   | 
[.../org/apache/hudi/MergeOnReadSnapshotRelation.scala](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL01lcmdlT25SZWFkU25hcHNob3RSZWxhdGlvbi5zY2FsYQ==)
 | | | |
   | 
[.../org/apache/hudi/exception/HoodieKeyException.java](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUtleUV4Y2VwdGlvbi5qYXZh)
 | | | |
   | 
[.../apache/hudi/common/bloom/BloomFilterTypeCode.java](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0Jsb29tRmlsdGVyVHlwZUNvZGUuamF2YQ==)
 | | | |
   | 
[...able/timeline/versioning/AbstractMigratorBase.java](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvQWJzdHJhY3RNaWdyYXRvckJhc2UuamF2YQ==)
 | | | |
   | 
[...rc/main/java/org/apache/hudi/cli/HoodiePrompt.java](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZVByb21wdC5qYXZh)
 | | | |
   | 
[.../org/apache/hudi/common/model/HoodieTableType.java](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVRhYmxlVHlwZS5qYXZh)
 | | | |
   | 
[.../scala/org/apache/hudi/Spark2RowDeserializer.scala](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmsyL3NyYy9tYWluL3NjYWxhL29yZy9hcGFjaGUvaHVkaS9TcGFyazJSb3dEZXNlcmlhbGl6ZXIuc2NhbGE=)
 | | | |
   | 
[...hudi/common/table/log/block/HoodieDeleteBlock.java](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVEZWxldGVCbG9jay5qYXZh)
 | | | |
   | 
[...cala/org/apache/hudi/HoodieBootstrapRelation.scala](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUJvb3RzdHJhcFJlbGF0aW9uLnNjYWxh)
 | | | |
   | ... and [356 
more](https://codecov.io/gh/apache/hudi/pull/2487/diff?src=pr&el=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about

[GitHub] [hudi] codecov-io edited a comment on pull request #2487: [WIP HUDI-53] Adding Record Level Index based on hoodie backed table

2021-01-25 Thread GitBox


codecov-io edited a comment on pull request #2487:
URL: https://github.com/apache/hudi/pull/2487#issuecomment-767228748







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jingweiz2017 commented on issue #1971: Schema evoluation causes issue when using kafka source in hudi deltastreamer

2021-01-25 Thread GitBox


jingweiz2017 commented on issue #1971:
URL: https://github.com/apache/hudi/issues/1971#issuecomment-767242422


   @nsivabalan @bvaradar , thanks for the reply. The commit mentioned by 
bvaradar should work for me case. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >