date:20220603

[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-03 Thread GitBox



hudi-bot commented on PR #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1146551501

   
   ## CI report:
   
   * 0e87e2b237e7272ee2e321e91280754d18c63f87 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9072)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] srinugsr2020 commented on issue #3894: [SUPPORT] Property hoodie.datasource.write.recordkey.field not found during version ONE to TWO migration

2022-06-03 Thread GitBox



srinugsr2020 commented on issue #3894:
URL: https://github.com/apache/hudi/issues/3894#issuecomment-1146548461

   Hi,
   I recently upgraded Apache Hudi version to 0.10.0 and realized that redshift 
spectrum is not able to read data. Found couple of other links related to same 
issue but no solution yet. Only option right now is downgrade to earlier 
version. I tried downgrade command but received same error 
(java.lang.IllegalArgumentException: Property 
hoodie.datasource.write.recordkey.field not found) even after setting 
Hudi.metadata.enable to false in Hudi.properties. But I tried downgrade to 
version one and it ran fine however when I ran job it is still giving error 
unknown version code 3.
   I am using AWS EMR 6.6.0 with Hudi cli 0.10. Please suggest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

2022-06-03 Thread GitBox



hudi-bot commented on PR #5737:
URL: https://github.com/apache/hudi/pull/5737#issuecomment-1146545735

   
   ## CI report:
   
   * dad16d1d712576b3b92389ab6ab045dc16bdafbf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9071)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-03 Thread GitBox



hudi-bot commented on PR #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1146533938

   
   ## CI report:
   
   * 6a2ed8538256bc9ee9ef5470cc5c573739a75b4b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9069)
 
   * 0e87e2b237e7272ee2e321e91280754d18c63f87 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9072)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-03 Thread GitBox



hudi-bot commented on PR #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1146533034

   
   ## CI report:
   
   * 6a2ed8538256bc9ee9ef5470cc5c573739a75b4b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9069)
 
   * 0e87e2b237e7272ee2e321e91280754d18c63f87 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



hudi-bot commented on PR #5733:
URL: https://github.com/apache/hudi/pull/5733#issuecomment-1146532508

   
   ## CI report:
   
   * bb436a73c4bd66a5a90467475710851b598d2ae9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9070)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

2022-06-03 Thread GitBox



hudi-bot commented on PR #5737:
URL: https://github.com/apache/hudi/pull/5737#issuecomment-1146520863

   
   ## CI report:
   
   * f84be3540d82f6bbb06e3f690671465298f63c9c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9067)
 
   * dad16d1d712576b3b92389ab6ab045dc16bdafbf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9071)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



hudi-bot commented on PR #5733:
URL: https://github.com/apache/hudi/pull/5733#issuecomment-1146520849

   
   ## CI report:
   
   * e6173f00c290e9e0bb55e4c7d8092f5eb26871ae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9066)
 
   * bb436a73c4bd66a5a90467475710851b598d2ae9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9070)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-03 Thread GitBox



hudi-bot commented on PR #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1146520556

   
   ## CI report:
   
   * be76a443fe07639f0eb0cd5727ff64dc3fe29c22 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8863)
 
   * 6a2ed8538256bc9ee9ef5470cc5c573739a75b4b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9069)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

2022-06-03 Thread GitBox



hudi-bot commented on PR #5737:
URL: https://github.com/apache/hudi/pull/5737#issuecomment-1146520090

   
   ## CI report:
   
   * f84be3540d82f6bbb06e3f690671465298f63c9c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9067)
 
   * dad16d1d712576b3b92389ab6ab045dc16bdafbf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



hudi-bot commented on PR #5733:
URL: https://github.com/apache/hudi/pull/5733#issuecomment-1146520083

   
   ## CI report:
   
   * e6173f00c290e9e0bb55e4c7d8092f5eb26871ae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9066)
 
   * bb436a73c4bd66a5a90467475710851b598d2ae9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3

2022-06-03 Thread GitBox



hudi-bot commented on PR #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1146519726

   
   ## CI report:
   
   * be76a443fe07639f0eb0cd5727ff64dc3fe29c22 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8863)
 
   * 6a2ed8538256bc9ee9ef5470cc5c573739a75b4b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5747: [HUDI-4171] Fixing Non partitioned with virtual keys in read path

2022-06-03 Thread GitBox



hudi-bot commented on PR #5747:
URL: https://github.com/apache/hudi/pull/5747#issuecomment-1146492665

   
   ## CI report:
   
   * 5e2e2ecd996e075f8a8ad026e918426f4d4cacce Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9068)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

2022-06-03 Thread GitBox



hudi-bot commented on PR #5737:
URL: https://github.com/apache/hudi/pull/5737#issuecomment-1146480680

   
   ## CI report:
   
   * f84be3540d82f6bbb06e3f690671465298f63c9c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9067)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



hudi-bot commented on PR #5733:
URL: https://github.com/apache/hudi/pull/5733#issuecomment-1146480673

   
   ## CI report:
   
   * e6173f00c290e9e0bb55e4c7d8092f5eb26871ae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9066)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

2022-06-03 Thread GitBox



alexeykudinkin commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889425572


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala:
##
@@ -105,12 +106,16 @@ class HoodieCatalog extends DelegatingCatalogExtension
   case _ =>
 catalogTable0
 }
-HoodieInternalV2Table(
+
+val v2Table = HoodieInternalV2Table(
   spark = spark,
   path = catalogTable.location.toString,
   catalogTable = Some(catalogTable),
   tableIdentifier = Some(ident.toString))
-  case o => o
+// TODO elaborate
+v2Table.v1TableWrapper

Review Comment:
   Why? Catalog still exposes the methods to write into Hudi tables



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

2022-06-03 Thread GitBox



leesf commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889424940


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala:
##
@@ -105,12 +106,16 @@ class HoodieCatalog extends DelegatingCatalogExtension
   case _ =>
 catalogTable0
 }
-HoodieInternalV2Table(
+
+val v2Table = HoodieInternalV2Table(
   spark = spark,
   path = catalogTable.location.toString,
   catalogTable = Some(catalogTable),
   tableIdentifier = Some(ident.toString))
-  case o => o
+// TODO elaborate
+v2Table.v1TableWrapper

Review Comment:
   it means users no long able to write/read data when specifying HoodieCatalog 
if using V1Table since it has no capabilities.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

2022-06-03 Thread GitBox



leesf commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889424940


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala:
##
@@ -105,12 +106,16 @@ class HoodieCatalog extends DelegatingCatalogExtension
   case _ =>
 catalogTable0
 }
-HoodieInternalV2Table(
+
+val v2Table = HoodieInternalV2Table(
   spark = spark,
   path = catalogTable.location.toString,
   catalogTable = Some(catalogTable),
   tableIdentifier = Some(ident.toString))
-  case o => o
+// TODO elaborate
+v2Table.v1TableWrapper

Review Comment:
   it means users no long able to write/read data when specifying HoodieCatalog



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



alexeykudinkin commented on code in PR #5733:
URL: https://github.com/apache/hudi/pull/5733#discussion_r889423624


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##
@@ -128,7 +133,11 @@ public void write(InternalRow record) throws IOException {
 if (!keyGeneratorOpt.isPresent()) { // NoPartitionerKeyGen
   partitionPath = "";
 } else if (simpleKeyGen) { // SimpleKeyGen
-  partitionPath = (record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType)).toString();
+  Object parititionPathValue = record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType);

Review Comment:
   Yes, will rebase before landing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



alexeykudinkin commented on PR #5733:
URL: https://github.com/apache/hudi/pull/5733#issuecomment-1146477125

   Build succeeded:
   
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=9066&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



alexeykudinkin commented on code in PR #5733:
URL: https://github.com/apache/hudi/pull/5733#discussion_r889403945


##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -626,4 +499,128 @@ public Option 
getTableHistorySchemaStrFromCommitMetadata() {
 String result = manager.getHistorySchemaStr();
 return result.isEmpty() ? Option.empty() : Option.of(result);
   }
+
+  /**
+   * NOTE: This method could only be used in tests
+   *
+   * @VisibleForTesting
+   */
+  public boolean hasOperationField() {

Review Comment:
   Method did not change



##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java:
##
@@ -267,43 +265,47 @@ public Option getInstantDetails(HoodieInstant 
instant) {
   }
 
   /**
-   * Get the last instant with valid schema, and convert this to 
HoodieCommitMetadata
+   * Returns most recent instant having valid schema in its {@link 
HoodieCommitMetadata}
*/
   public Option> 
getLastCommitMetadataWithValidSchema() {
-List completed = 
getCommitsTimeline().filterCompletedInstants().getInstants()
-
.sorted(Comparator.comparing(HoodieInstant::getTimestamp).reversed()).collect(Collectors.toList());
-for (HoodieInstant instant : completed) {
-  try {
-HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(
-getInstantDetails(instant).get(), HoodieCommitMetadata.class);
-if 
(!StringUtils.isNullOrEmpty(commitMetadata.getMetadata(HoodieCommitMetadata.SCHEMA_KEY)))
 {
-  return Option.of(Pair.of(instant, commitMetadata));
-}
-  } catch (IOException e) {
-LOG.warn("Failed to convert instant to HoodieCommitMetadata: " + 
instant.toString());
-  }
-}
-return Option.empty();
+return Option.fromJavaOptional(
+getCommitMetadataStream()
+.filter(instantCommitMetadataPair ->
+
!StringUtils.isNullOrEmpty(instantCommitMetadataPair.getValue().getMetadata(HoodieCommitMetadata.SCHEMA_KEY)))
+.findFirst()
+);
   }
 
   /**
* Get the last instant with valid data, and convert this to 
HoodieCommitMetadata
*/
   public Option> 
getLastCommitMetadataWithValidData() {
-List completed = 
getCommitsTimeline().filterCompletedInstants().getInstants()
-
.sorted(Comparator.comparing(HoodieInstant::getTimestamp).reversed()).collect(Collectors.toList());
-for (HoodieInstant instant : completed) {
-  try {
-HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(
-getInstantDetails(instant).get(), HoodieCommitMetadata.class);
-if (!commitMetadata.getFileIdAndRelativePaths().isEmpty()) {
-  return Option.of(Pair.of(instant, commitMetadata));
-}
-  } catch (IOException e) {
-LOG.warn("Failed to convert instant to HoodieCommitMetadata: " + 
instant.toString());
-  }
-}
-return Option.empty();
+return Option.fromJavaOptional(
+getCommitMetadataStream()
+.filter(instantCommitMetadataPair ->
+
!instantCommitMetadataPair.getValue().getFileIdAndRelativePaths().isEmpty())
+.findFirst()
+);
+  }
+
+  /**
+   * Returns stream of {@link HoodieCommitMetadata} in order reverse to 
chronological (ie most
+   * recent metadata being the first element)
+   */
+  private Stream> 
getCommitMetadataStream() {
+// NOTE: Streams are lazy

Review Comment:
   Yes, streams are lazy therefore it will only compute whole chain only for it 
to get a single object



##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -626,4 +499,128 @@ public Option 
getTableHistorySchemaStrFromCommitMetadata() {
 String result = manager.getHistorySchemaStr();
 return result.isEmpty() ? Option.empty() : Option.of(result);
   }
+
+  /**
+   * NOTE: This method could only be used in tests
+   *
+   * @VisibleForTesting
+   */
+  public boolean hasOperationField() {
+try {
+  Schema tableAvroSchema = getTableAvroSchemaFromDataFile();
+  return tableAvroSchema.getField(HoodieRecord.OPERATION_METADATA_FIELD) 
!= null;
+} catch (Exception e) {
+  LOG.info(String.format("Failed to read operation field from avro schema 
(%s)", e.getMessage()));
+  return false;
+}
+  }
+
+  private Option> 
getLatestCommitMetadataWithValidSchema() {
+if (latestCommitWithValidSchema == null) {
+  Option> 
instantAndCommitMetadata =
+  
metaClient.getActiveTimeline().getLastCommitMetadataWithValidSchema();
+  if (instantAndCommitMetadata.isPresent()) {
+HoodieInstant instant = instantAndCommitMetadata.get().getLeft();
+HoodieCommitMetadata metadata = 
instantAndCommitMetadata.get().getRight();
+synchronized (this) {
+  if (latestCommitWithValidSchema == null) {
+latestCommitWithValidSchema = instant;
+

[GitHub] [hudi] hudi-bot commented on pull request #5747: [HUDI-4171] Fixing Non partitioned with virtual keys in read path

2022-06-03 Thread GitBox



hudi-bot commented on PR #5747:
URL: https://github.com/apache/hudi/pull/5747#issuecomment-1146462631

   
   ## CI report:
   
   * 5e2e2ecd996e075f8a8ad026e918426f4d4cacce Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9068)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5747: [HUDI-4171] Fixing Non partitioned with virtual keys in read path

2022-06-03 Thread GitBox



hudi-bot commented on PR #5747:
URL: https://github.com/apache/hudi/pull/5747#issuecomment-1146460018

   
   ## CI report:
   
   * 5e2e2ecd996e075f8a8ad026e918426f4d4cacce UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



nsivabalan commented on code in PR #5733:
URL: https://github.com/apache/hudi/pull/5733#discussion_r889414242


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##
@@ -128,7 +133,11 @@ public void write(InternalRow record) throws IOException {
 if (!keyGeneratorOpt.isPresent()) { // NoPartitionerKeyGen
   partitionPath = "";
 } else if (simpleKeyGen) { // SimpleKeyGen
-  partitionPath = (record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType)).toString();
+  Object parititionPathValue = record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType);

Review Comment:
   Is it possible to remove the fixes that I have in 
https://github.com/apache/hudi/pull/5664. you may run into conflicts. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



nsivabalan commented on code in PR #5733:
URL: https://github.com/apache/hudi/pull/5733#discussion_r889414242


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##
@@ -128,7 +133,11 @@ public void write(InternalRow record) throws IOException {
 if (!keyGeneratorOpt.isPresent()) { // NoPartitionerKeyGen
   partitionPath = "";
 } else if (simpleKeyGen) { // SimpleKeyGen
-  partitionPath = (record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType)).toString();
+  Object parititionPathValue = record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType);

Review Comment:
   Is it possible to remove the fixes that I have in 
https://github.com/apache/hudi/pull/5664 you may run into conflicts. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4171) NonPartitioned Key gen w/ virtual keys fails to be read w/ presto

2022-06-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4171:
-
Labels: pull-request-available  (was: )

> NonPartitioned Key gen w/ virtual keys fails to be read w/ presto
> -
>
> Key: HUDI-4171
> URL: https://issues.apache.org/jira/browse/HUDI-4171
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Looks like non partitioned key gen does not work well when Virtual keys are 
> enabled. 
>  
> {code:java}
> Query 20220531_171243_00023_eudi3 failed: Fetching table schema failed with 
> exception 
> io.prestosql.spi.PrestoException: Fetching table schema failed with exception 
>   at 
> io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:223)
>   at 
> io.prestosql.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>   at io.prestosql.$gen.Presto_33220220531_134705_2.run(Unknown Source)
>   at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: Fetching table schema 
> failed with exception 
>   at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.getHoodieVirtualKeyInfo(HoodieCopyOnWriteTableInputFormat.java:289)
>   at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:245)
>   at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:140)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>   at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>   at 
> io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:407)
>   at 
> io.prestosql.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:287)
>   at 
> io.prestosql.plugin.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:107)
>   at 
> io.prestosql.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:216)
>   ... 6 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.getHoodieVirtualKeyInfo(HoodieCopyOnWriteTableInputFormat.java:287)
>   ... 14 more {code}
> Original table:
> bulk insert row writer
> Non partitioned key gen
> Disable meta fields. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] nsivabalan opened a new pull request, #5747: [HUDI-4171] Fixing Non partitioned with virtual keys in read path

2022-06-03 Thread GitBox



nsivabalan opened a new pull request, #5747:
URL: https://github.com/apache/hudi/pull/5747

   ## What is the purpose of the pull request
   
   When Non partitioned key gen is used with virtual keys, read path could 
break since partition path may not exist. Fixing that in this patch. 
   
   ## Brief change log
   
   - Fixed generating virtual key info on the read path. 
   
   ## Verify this pull request
   
   Added test to 
   - TestHoodieParquetInputFormat to validate the fix. 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-03 Thread GitBox



hudi-bot commented on PR #5737:
URL: https://github.com/apache/hudi/pull/5737#issuecomment-1146435260

   
   ## CI report:
   
   * 0290f672dd33aefe7ad33edc95e12979ebf035bd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9045)
 
   * f84be3540d82f6bbb06e3f690671465298f63c9c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9067)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-03 Thread GitBox



hudi-bot commented on PR #5737:
URL: https://github.com/apache/hudi/pull/5737#issuecomment-1146433477

   
   ## CI report:
   
   * 0290f672dd33aefe7ad33edc95e12979ebf035bd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9045)
 
   * f84be3540d82f6bbb06e3f690671465298f63c9c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



hudi-bot commented on PR #5733:
URL: https://github.com/apache/hudi/pull/5733#issuecomment-1146433441

   
   ## CI report:
   
   * 1e269381cb33f0f92be0749eeea66c3368fc225e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9041)
 
   * e6173f00c290e9e0bb55e4c7d8092f5eb26871ae Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9066)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



alexeykudinkin commented on code in PR #5733:
URL: https://github.com/apache/hudi/pull/5733#discussion_r889401027


##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -176,86 +124,25 @@ public Schema getTableAvroSchema() throws Exception {
* @throws Exception
*/
   public Schema getTableAvroSchema(boolean includeMetadataFields) throws 
Exception {
-Schema schema;
-Option schemaFromCommitMetadata = 
getTableSchemaFromCommitMetadata(includeMetadataFields);
-if (schemaFromCommitMetadata.isPresent()) {
-  schema = schemaFromCommitMetadata.get();
-} else {
-  Option schemaFromTableConfig = 
metaClient.getTableConfig().getTableCreateSchema();
-  if (schemaFromTableConfig.isPresent()) {
-if (includeMetadataFields) {
-  schema = 
HoodieAvroUtils.addMetadataFields(schemaFromTableConfig.get(), 
hasOperationField);
-} else {
-  schema = schemaFromTableConfig.get();
-}
-  } else {
-if (includeMetadataFields) {
-  schema = getTableAvroSchemaFromDataFile();
-} else {
-  schema = 
HoodieAvroUtils.removeMetadataFields(getTableAvroSchemaFromDataFile());
-}
-  }
-}
-
-Option partitionFieldsOpt = 
metaClient.getTableConfig().getPartitionFields();
-if (metaClient.getTableConfig().shouldDropPartitionColumns()) {
-  schema = recreateSchemaWhenDropPartitionColumns(partitionFieldsOpt, 
schema);
-}
-return schema;
+return getTableAvroSchemaInternal(includeMetadataFields, Option.empty());
   }
 
-  public static Schema recreateSchemaWhenDropPartitionColumns(Option 
partitionFieldsOpt, Schema originSchema) {
-// when hoodie.datasource.write.drop.partition.columns is true, partition 
columns can't be persisted in data files.
-// And there are no partition schema if the schema is parsed from data 
files.
-// Here we create partition Fields for this case, and use StringType as 
the data type.
-Schema schema = originSchema;
-if (partitionFieldsOpt.isPresent() && partitionFieldsOpt.get().length != 
0) {
-  List partitionFields = Arrays.asList(partitionFieldsOpt.get());
-
-  final Schema schema0 = originSchema;
-  boolean hasPartitionColNotInSchema = partitionFields.stream().anyMatch(
-  pt -> !HoodieAvroUtils.containsFieldInSchema(schema0, pt)
-  );
-  boolean hasPartitionColInSchema = partitionFields.stream().anyMatch(
-  pt -> HoodieAvroUtils.containsFieldInSchema(schema0, pt)
-  );
-  if (hasPartitionColNotInSchema && hasPartitionColInSchema) {
-throw new HoodieIncompatibleSchemaException(
-"Not support: Partial partition fields are still in the schema "
-+ "when enable 
hoodie.datasource.write.drop.partition.columns");
-  }
-
-  if (hasPartitionColNotInSchema) {
-// when hasPartitionColNotInSchema is true and hasPartitionColInSchema 
is false, all partition columns
-// are not in originSchema. So we create and add them.
-List newFields = new ArrayList<>();
-for (String partitionField: partitionFields) {
-  newFields.add(new Schema.Field(
-  partitionField, createNullableSchema(Schema.Type.STRING), "", 
JsonProperties.NULL_VALUE));
-}
-schema = appendFieldsToSchema(schema, newFields);
-  }
-}
-return schema;
+  /**
+   * Fetches tables schema in Avro format as of the given instant
+   *
+   * @param instant as of which table's schema will be fetched
+   */
+  public Schema getTableAvroSchema(HoodieInstant instant, boolean 
includeMetadataFields) throws Exception {
+return getTableAvroSchemaInternal(includeMetadataFields, 
Option.of(instant));
   }
 
   /**
* Gets full schema (user + metadata) for a hoodie table in Parquet format.
*
* @return Parquet schema for the table
-   * @throws Exception
*/
   public MessageType getTableParquetSchema() throws Exception {
-Option schemaFromCommitMetadata = 
getTableSchemaFromCommitMetadata(true);
-if (schemaFromCommitMetadata.isPresent()) {
-  return convertAvroSchemaToParquet(schemaFromCommitMetadata.get());
-}
-Option schemaFromTableConfig = 
metaClient.getTableConfig().getTableCreateSchema();
-if (schemaFromTableConfig.isPresent()) {
-  Schema schema = 
HoodieAvroUtils.addMetadataFields(schemaFromTableConfig.get(), 
hasOperationField);
-  return convertAvroSchemaToParquet(schema);
-}
-return getTableParquetSchemaFromDataFile();
+return convertAvroSchemaToParquet(getTableAvroSchema(true));

Review Comment:
   It was not handled correctly before -- this config has to be handled in all 
code-paths



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about th

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



alexeykudinkin commented on code in PR #5733:
URL: https://github.com/apache/hudi/pull/5733#discussion_r889401027


##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -176,86 +124,25 @@ public Schema getTableAvroSchema() throws Exception {
* @throws Exception
*/
   public Schema getTableAvroSchema(boolean includeMetadataFields) throws 
Exception {
-Schema schema;
-Option schemaFromCommitMetadata = 
getTableSchemaFromCommitMetadata(includeMetadataFields);
-if (schemaFromCommitMetadata.isPresent()) {
-  schema = schemaFromCommitMetadata.get();
-} else {
-  Option schemaFromTableConfig = 
metaClient.getTableConfig().getTableCreateSchema();
-  if (schemaFromTableConfig.isPresent()) {
-if (includeMetadataFields) {
-  schema = 
HoodieAvroUtils.addMetadataFields(schemaFromTableConfig.get(), 
hasOperationField);
-} else {
-  schema = schemaFromTableConfig.get();
-}
-  } else {
-if (includeMetadataFields) {
-  schema = getTableAvroSchemaFromDataFile();
-} else {
-  schema = 
HoodieAvroUtils.removeMetadataFields(getTableAvroSchemaFromDataFile());
-}
-  }
-}
-
-Option partitionFieldsOpt = 
metaClient.getTableConfig().getPartitionFields();
-if (metaClient.getTableConfig().shouldDropPartitionColumns()) {
-  schema = recreateSchemaWhenDropPartitionColumns(partitionFieldsOpt, 
schema);
-}
-return schema;
+return getTableAvroSchemaInternal(includeMetadataFields, Option.empty());
   }
 
-  public static Schema recreateSchemaWhenDropPartitionColumns(Option 
partitionFieldsOpt, Schema originSchema) {
-// when hoodie.datasource.write.drop.partition.columns is true, partition 
columns can't be persisted in data files.
-// And there are no partition schema if the schema is parsed from data 
files.
-// Here we create partition Fields for this case, and use StringType as 
the data type.
-Schema schema = originSchema;
-if (partitionFieldsOpt.isPresent() && partitionFieldsOpt.get().length != 
0) {
-  List partitionFields = Arrays.asList(partitionFieldsOpt.get());
-
-  final Schema schema0 = originSchema;
-  boolean hasPartitionColNotInSchema = partitionFields.stream().anyMatch(
-  pt -> !HoodieAvroUtils.containsFieldInSchema(schema0, pt)
-  );
-  boolean hasPartitionColInSchema = partitionFields.stream().anyMatch(
-  pt -> HoodieAvroUtils.containsFieldInSchema(schema0, pt)
-  );
-  if (hasPartitionColNotInSchema && hasPartitionColInSchema) {
-throw new HoodieIncompatibleSchemaException(
-"Not support: Partial partition fields are still in the schema "
-+ "when enable 
hoodie.datasource.write.drop.partition.columns");
-  }
-
-  if (hasPartitionColNotInSchema) {
-// when hasPartitionColNotInSchema is true and hasPartitionColInSchema 
is false, all partition columns
-// are not in originSchema. So we create and add them.
-List newFields = new ArrayList<>();
-for (String partitionField: partitionFields) {
-  newFields.add(new Schema.Field(
-  partitionField, createNullableSchema(Schema.Type.STRING), "", 
JsonProperties.NULL_VALUE));
-}
-schema = appendFieldsToSchema(schema, newFields);
-  }
-}
-return schema;
+  /**
+   * Fetches tables schema in Avro format as of the given instant
+   *
+   * @param instant as of which table's schema will be fetched
+   */
+  public Schema getTableAvroSchema(HoodieInstant instant, boolean 
includeMetadataFields) throws Exception {
+return getTableAvroSchemaInternal(includeMetadataFields, 
Option.of(instant));
   }
 
   /**
* Gets full schema (user + metadata) for a hoodie table in Parquet format.
*
* @return Parquet schema for the table
-   * @throws Exception
*/
   public MessageType getTableParquetSchema() throws Exception {
-Option schemaFromCommitMetadata = 
getTableSchemaFromCommitMetadata(true);
-if (schemaFromCommitMetadata.isPresent()) {
-  return convertAvroSchemaToParquet(schemaFromCommitMetadata.get());
-}
-Option schemaFromTableConfig = 
metaClient.getTableConfig().getTableCreateSchema();
-if (schemaFromTableConfig.isPresent()) {
-  Schema schema = 
HoodieAvroUtils.addMetadataFields(schemaFromTableConfig.get(), 
hasOperationField);
-  return convertAvroSchemaToParquet(schema);
-}
-return getTableParquetSchemaFromDataFile();
+return convertAvroSchemaToParquet(getTableAvroSchema(true));

Review Comment:
   It was not handled correctly before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@

[GitHub] [hudi] hudi-bot commented on pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



hudi-bot commented on PR #5733:
URL: https://github.com/apache/hudi/pull/5733#issuecomment-1146431494

   
   ## CI report:
   
   * 1e269381cb33f0f92be0749eeea66c3368fc225e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9041)
 
   * e6173f00c290e9e0bb55e4c7d8092f5eb26871ae UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5733: [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing

2022-06-03 Thread GitBox



alexeykudinkin commented on code in PR #5733:
URL: https://github.com/apache/hudi/pull/5733#discussion_r889400793


##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -58,100 +60,46 @@
 import org.apache.parquet.hadoop.metadata.ParquetMetadata;
 import org.apache.parquet.schema.MessageType;
 
+import javax.annotation.concurrent.ThreadSafe;
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Iterator;
 import java.util.List;
+import java.util.concurrent.ConcurrentHashMap;
 
 import static org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchema;
+import static org.apache.hudi.avro.AvroSchemaUtils.containsFieldInSchema;
 import static org.apache.hudi.avro.AvroSchemaUtils.createNullableSchema;
 
 /**
  * Helper class to read schema from data files and log files and to convert it 
between different formats.
- *
- * TODO(HUDI-3626) cleanup
  */
+@ThreadSafe
 public class TableSchemaResolver {
 
   private static final Logger LOG = 
LogManager.getLogger(TableSchemaResolver.class);
-  private final HoodieTableMetaClient metaClient;
-  private final boolean hasOperationField;
 
-  public TableSchemaResolver(HoodieTableMetaClient metaClient) {
-this.metaClient = metaClient;
-this.hasOperationField = hasOperationField();
-  }
+  private final HoodieTableMetaClient metaClient;
 
   /**
-   * Gets the schema for a hoodie table. Depending on the type of table, read 
from any file written in the latest
-   * commit. We will assume that the schema has not changed within a single 
atomic write.
+   * NOTE: {@link HoodieCommitMetadata} could be of non-trivial size for large 
tables (in 100s of Mbs)
+   *   and therefore we'd want to limit amount of throw-away work being 
performed while fetching
+   *   commits' metadata
*
-   * @return Parquet schema for this table
+   *   Please check out corresponding methods to fetch commonly used 
instances of {@link HoodieCommitMetadata}:
+   *   {@link #getLatestCommitMetadataWithValidSchema()},
+   *   {@link #getLatestCommitMetadataWithValidSchema()},
+   *   {@link #getCachedCommitMetadata(HoodieInstant)}
*/
-  private MessageType getTableParquetSchemaFromDataFile() {
-HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();
-Option> instantAndCommitMetadata 
=
-activeTimeline.getLastCommitMetadataWithValidData();
-try {
-  switch (metaClient.getTableType()) {
-case COPY_ON_WRITE:
-  // For COW table, the file has data written must be in parquet or 
orc format currently.
-  if (instantAndCommitMetadata.isPresent()) {
-HoodieCommitMetadata commitMetadata = 
instantAndCommitMetadata.get().getRight();
-Iterator filePaths = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
-return fetchSchemaFromFiles(filePaths);
-  } else {
-throw new IllegalArgumentException("Could not find any data file 
written for commit, "
-+ "so could not get schema for table " + 
metaClient.getBasePath());
-  }
-case MERGE_ON_READ:
-  // For MOR table, the file has data written may be a parquet file, 
.log file, orc file or hfile.
-  // Determine the file format based on the file name, and then 
extract schema from it.
-  if (instantAndCommitMetadata.isPresent()) {
-HoodieCommitMetadata commitMetadata = 
instantAndCommitMetadata.get().getRight();
-Iterator filePaths = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
-return fetchSchemaFromFiles(filePaths);
-  } else {
-throw new IllegalArgumentException("Could not find any data file 
written for commit, "
-+ "so could not get schema for table " + 
metaClient.getBasePath());
-  }
-default:
-  LOG.error("Unknown table type " + metaClient.getTableType());
-  throw new InvalidTableException(metaClient.getBasePath());
-  }
-} catch (IOException e) {
-  throw new HoodieException("Failed to read data schema", e);
-}
-  }
+  private final Lazy> 
commitMetadataCache;

Review Comment:
   Discussed offline: `TableSchemaResolver` is a short-lived object not meant 
to be refreshed -- to get latest schema you will have to create another 
instance with the refreshed `HoodieMetaClient` instance



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

2022-06-03 Thread GitBox



leesf commented on issue #5488:
URL: https://github.com/apache/hudi/issues/5488#issuecomment-1146421860

   Closing the issue, @parisni please reopen if you have new problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf closed issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog used

2022-06-03 Thread GitBox



leesf closed issue #5488: [SUPPORT] Read hive Table fail when HoodieCatalog 
used 
URL: https://github.com/apache/hudi/issues/5488


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on issue #5537: hudi supports custom catalog name, spark_catalog is not mandatory

2022-06-03 Thread GitBox



leesf commented on issue #5537:
URL: https://github.com/apache/hudi/issues/5537#issuecomment-1146419871

   @melin I think you can specify `spark_catalog` to `HoodieCatalog` and custom 
catalog for iceberg catalog for a currently workaround, since Hudi currently do 
not support custom catalogs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: Resized the blog images , tags onHover:blue, readingTime only visible in blogs page (#5745)

2022-06-03 Thread bhavanisudha

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new eb17d00b22 Resized the blog images , tags onHover:blue, readingTime 
only visible in blogs page (#5745)
eb17d00b22 is described below

commit eb17d00b2239dfb42f3b7643cac40c0dd36a7bd9
Author: yadav-jai <97013124+yadav-...@users.noreply.github.com>
AuthorDate: Sat Jun 4 04:04:07 2022 +0530

Resized the blog images , tags onHover:blue, readingTime only visible in 
blogs page (#5745)
---
 website/src/css/custom.css  | 9 ++---
 website/src/theme/BlogPostItem/index.js | 6 +++---
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/website/src/css/custom.css b/website/src/css/custom.css
index cee132bb1b..57eeca4bb2 100644
--- a/website/src/css/custom.css
+++ b/website/src/css/custom.css
@@ -205,15 +205,18 @@ footer .container {
 }
 
 .blogThumbnail img {
-  height: auto;
-  width: 100%;
+  height:100%;
+  width: auto;
+}
+.tagRegular_node_modules-\@docusaurus-theme-classic-lib-next-theme-Tag-styles-module{
+  color:black
 }
-
 
 
 .blog-list-page article {
   display: inline-flex;
   width: 45%;
+ 
   margin: 1.2em;
   vertical-align: text-top;
   
diff --git a/website/src/theme/BlogPostItem/index.js 
b/website/src/theme/BlogPostItem/index.js
index 1478e1861f..8c9525e2f2 100644
--- a/website/src/theme/BlogPostItem/index.js
+++ b/website/src/theme/BlogPostItem/index.js
@@ -68,7 +68,7 @@
 return (
 <>
   
-  Tags:
+ 
 {tags.map(({label, permalink: tagPermalink}) => (
   
 
@@ -155,12 +155,12 @@ const AuthorsList = () => {
{AuthorsList()}
  

- 
+ {isBlogPostPage && <>
{typeof readingTime !== 'undefined' && (
  <>
{readingTimePlural(readingTime)}
  
-   )}
+   )}}

[GitHub] [hudi] bhasudha merged pull request #5745: [MINOR][UI]Resized the blog images , tags onHover:blue, readingTime only visible in blogs page

2022-06-03 Thread GitBox



bhasudha merged PR #5745:
URL: https://github.com/apache/hudi/pull/5745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #5729: [SUPPORT] Environment issues when running Demo for v0.11

2022-06-03 Thread GitBox



xushiyan commented on issue #5729:
URL: https://github.com/apache/hudi/issues/5729#issuecomment-1146407055

   @GnsCy right; this more likely caused by out-of-date instructions or configs 
that may changed in the newer release. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4178) Performance regressions in Spark DataSourceV2 Integration

2022-06-03 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4178:
--
Story Points: 4  (was: 1)
 Summary: Performance regressions in Spark DataSourceV2 Integration  
(was: HoodieSpark3Analysis does not pass schema from Spark Catalog)

> Performance regressions in Spark DataSourceV2 Integration
> -
>
> Key: HUDI-4178
> URL: https://issues.apache.org/jira/browse/HUDI-4178
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.11.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> There are multiple issues with our current DataSource V2 integrations:
> Because we advertise Hudi tables as V2, Spark expects it to implement certain 
> APIs which are not implemented at the moment, instead we're using custom 
> Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 
> APIs. This poses following problems
>  # It doesn't fully implement Spark's protocol: for ex, this rule doesn't 
> cache produced `LogicalPlan` making Spark re-create Hudi relations from 
> scratch (including doing full table's file-listing) for every query reading 
> this table. However, adding the caching in that sequence is not an option, 
> since V2 APIs manage cache differently and therefore for us to be able to 
> leverage that cache we will have to manage all of its lifecycle (adding, 
> flushing)
>  # Additionally, HoodieSpark3Analysis rule does not pass table's schema from 
> the Spark Catalog to Hudi's relations making them fetch the schema from 
> storage (either from commit's metadata or data file) every time
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4178) HoodieSpark3Analysis does not pass schema from Spark Catalog

2022-06-03 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4178:
--
Description: 
There are multiple issues with our current DataSource V2 integrations:

Because we advertise Hudi tables as V2, Spark expects it to implement certain 
APIs which are not implemented at the moment, instead we're using custom 
Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 
APIs. This poses following problems
 # It doesn't fully implement Spark's protocol: for ex, this rule doesn't cache 
produced `LogicalPlan` making Spark re-create Hudi relations from scratch 
(including doing full table's file-listing) for every query reading this table. 
However, adding the caching in that sequence is not an option, since V2 APIs 
manage cache differently and therefore for us to be able to leverage that cache 
we will have to manage all of its lifecycle (adding, flushing)
 # Additionally, HoodieSpark3Analysis rule does not pass table's schema from 
the Spark Catalog to Hudi's relations making them fetch the schema from storage 
(either from commit's metadata or data file) every time

 

  was:
Currently, HoodieSpark3Analysis rule does not pass table's schema from the 
Spark Catalog to Hudi's relations making them fetch the schema from storage 
(either from commit's metadata or data file) every time.

 


> HoodieSpark3Analysis does not pass schema from Spark Catalog
> 
>
> Key: HUDI-4178
> URL: https://issues.apache.org/jira/browse/HUDI-4178
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.11.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> There are multiple issues with our current DataSource V2 integrations:
> Because we advertise Hudi tables as V2, Spark expects it to implement certain 
> APIs which are not implemented at the moment, instead we're using custom 
> Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 
> APIs. This poses following problems
>  # It doesn't fully implement Spark's protocol: for ex, this rule doesn't 
> cache produced `LogicalPlan` making Spark re-create Hudi relations from 
> scratch (including doing full table's file-listing) for every query reading 
> this table. However, adding the caching in that sequence is not an option, 
> since V2 APIs manage cache differently and therefore for us to be able to 
> leverage that cache we will have to manage all of its lifecycle (adding, 
> flushing)
>  # Additionally, HoodieSpark3Analysis rule does not pass table's schema from 
> the Spark Catalog to Hudi's relations making them fetch the schema from 
> storage (either from commit's metadata or data file) every time
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] leesf commented on pull request #5743: [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables

2022-06-03 Thread GitBox



leesf commented on PR #5743:
URL: https://github.com/apache/hudi/pull/5743#issuecomment-1146401813

   > @leesf can you please add a description to the PR and also the Jira?
   
   @alexeykudinkin sure and done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4183) Fix using HoodieCatalog to create non-hudi tables

2022-06-03 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-4183:

Description: For now, when users specify `HoodieCatalog` in 0.11.0, they 
would not create non-hudi tables since HoodieCatalog#createTable do not handle 
the logic of non-hudi tables, in fact the logic is missed in #createTable 
method, and we should fix it.

> Fix using HoodieCatalog to create non-hudi tables
> -
>
> Key: HUDI-4183
> URL: https://issues.apache.org/jira/browse/HUDI-4183
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> For now, when users specify `HoodieCatalog` in 0.11.0, they would not create 
> non-hudi tables since HoodieCatalog#createTable do not handle the logic of 
> non-hudi tables, in fact the logic is missed in #createTable method, and we 
> should fix it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] yihua commented on a diff in pull request #5664: [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys

2022-06-03 Thread GitBox



yihua commented on code in PR #5664:
URL: https://github.com/apache/hudi/pull/5664#discussion_r889380010


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##
@@ -87,6 +89,7 @@ public BulkInsertDataInternalWriterHelper(HoodieTable 
hoodieTable, HoodieWriteCo
 this.populateMetaFields = populateMetaFields;
 this.arePartitionRecordsSorted = arePartitionRecordsSorted;
 this.fileIdPrefix = UUID.randomUUID().toString();
+this.isHiveStylePartitioning = 
writeConfig.isHiveStylePartitioningEnabled();

Review Comment:
   nit: `writeConfig` is saved inside this helper so we don't need to have 
another member variable `isHiveStylePartitioning`?



##
hudi-spark-datasource/hudi-spark2/src/test/java/org/apache/hudi/internal/TestHoodieBulkInsertDataInternalWriter.java:
##
@@ -109,6 +109,48 @@ public void testDataInternalWriter(boolean sorted, boolean 
populateMetaFields) t
 }
   }
 
+  @Test
+  public void testDataInternalWriterHiveStylePartitioning() throws Exception {
+boolean sorted = true;
+boolean populateMetaFields = false;
+// init config and table
+HoodieWriteConfig cfg = getWriteConfig(populateMetaFields, "true");
+HoodieTable table = HoodieSparkTable.create(cfg, context, metaClient);
+for (int i = 0; i < 1; i++) {
+  String instantTime = "00" + i;
+  // init writer
+  HoodieBulkInsertDataInternalWriter writer = new 
HoodieBulkInsertDataInternalWriter(table, cfg, instantTime, 
RANDOM.nextInt(10), RANDOM.nextLong(), RANDOM.nextLong(),
+  STRUCT_TYPE, populateMetaFields, sorted);
+
+  int size = 10 + RANDOM.nextInt(1000);
+  // write N rows to partition1, N rows to partition2 and N rows to 
partition3 ... Each batch should create a new RowCreateHandle and a new file
+  int batches = 3;
+  Dataset totalInputRows = null;
+
+  for (int j = 0; j < batches; j++) {
+String partitionPath = 
HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[j % 3];
+Dataset inputRows = getRandomRows(sqlContext, size, 
partitionPath, false);
+writeRows(inputRows, writer);
+if (totalInputRows == null) {
+  totalInputRows = inputRows;
+} else {
+  totalInputRows = totalInputRows.union(inputRows);
+}
+  }
+
+  BaseWriterCommitMessage commitMetadata = (BaseWriterCommitMessage) 
writer.commit();
+  Option> fileAbsPaths = Option.of(new ArrayList<>());
+  Option> fileNames = Option.of(new ArrayList<>());
+
+  // verify write statuses
+  assertWriteStatuses(commitMetadata.getWriteStatuses(), batches, size, 
sorted, fileAbsPaths, fileNames);
+
+  // verify rows
+  Dataset result = 
sqlContext.read().parquet(fileAbsPaths.get().toArray(new String[0]));
+  assertOutput(totalInputRows, result, instantTime, fileNames, 
populateMetaFields);

Review Comment:
   Do we want to validate the hive-style partition path value somewhere?



##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##
@@ -128,7 +133,11 @@ public void write(InternalRow record) throws IOException {
 if (!keyGeneratorOpt.isPresent()) { // NoPartitionerKeyGen
   partitionPath = "";
 } else if (simpleKeyGen) { // SimpleKeyGen
-  partitionPath = (record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType)).toString();
+  Object parititionPathValue = record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType);
+  partitionPath = parititionPathValue != null ? 
parititionPathValue.toString() : 
PartitionPathEncodeUtils.DEFAULT_PARTITION_PATH;
+  if (isHiveStylePartitioning) {
+partitionPath = 
(keyGeneratorOpt.get()).getPartitionPathFields().get(0) + "=" + partitionPath;

Review Comment:
   For `SimpleKeyGenerator`, there could be only one partition path field.  Is 
that correct?



##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java:
##
@@ -128,7 +133,11 @@ public void write(InternalRow record) throws IOException {
 if (!keyGeneratorOpt.isPresent()) { // NoPartitionerKeyGen
   partitionPath = "";
 } else if (simpleKeyGen) { // SimpleKeyGen
-  partitionPath = (record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType)).toString();
+  Object parititionPathValue = record.get(simplePartitionFieldIndex, 
simplePartitionFieldDataType);
+  partitionPath = parititionPathValue != null ? 
parititionPathValue.toString() : 
PartitionPathEncodeUtils.DEFAULT_PARTITION_PATH;
+  if (isHiveStylePartitioning) {
+partitionPath = 
(keyGeneratorOpt.get()).getPartitionPathFields().get(0) + "=" + partitionPath;

Review Comment:
   @nsivabalan could you simply leverage 
`SimpleKeyGenerator::getParti

[GitHub] [hudi] ctlgdanielli commented on issue #4622: [SUPPORT] Can't query Redshift rows even after downgrade from 0.10

2022-06-03 Thread GitBox



ctlgdanielli commented on issue #4622:
URL: https://github.com/apache/hudi/issues/4622#issuecomment-1146360783

   Hello, any updates ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5664: [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys

2022-06-03 Thread GitBox



hudi-bot commented on PR #5664:
URL: https://github.com/apache/hudi/pull/5664#issuecomment-1146345218

   
   ## CI report:
   
   * 18654512f52bf46f458d0275a844fd4e625e32e4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9063)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5664: [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys

2022-06-03 Thread GitBox



hudi-bot commented on PR #5664:
URL: https://github.com/apache/hudi/pull/5664#issuecomment-1146301312

   
   ## CI report:
   
   * db498bb903cdd264f29c3db616dab75bccffddaf Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9062)
 
   * 18654512f52bf46f458d0275a844fd4e625e32e4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9063)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-03 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1146269210

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   * 3694b869048eff12b408a86e295ba88d3d3168fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9061)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-03 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1146263278

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   * 3007879a9a938a65b1f7f9174c23f22f1bd82145 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9060)
 
   * 3694b869048eff12b408a86e295ba88d3d3168fb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9061)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5664: [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys

2022-06-03 Thread GitBox



hudi-bot commented on PR #5664:
URL: https://github.com/apache/hudi/pull/5664#issuecomment-1146260471

   
   ## CI report:
   
   * 63c2aa08ecec2dbbe98823f2b88b52874346a085 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8997)
 
   * db498bb903cdd264f29c3db616dab75bccffddaf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9062)
 
   * 18654512f52bf46f458d0275a844fd4e625e32e4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-03 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1146260103

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   * 3007879a9a938a65b1f7f9174c23f22f1bd82145 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9060)
 
   * 3694b869048eff12b408a86e295ba88d3d3168fb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5664: [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys

2022-06-03 Thread GitBox



hudi-bot commented on PR #5664:
URL: https://github.com/apache/hudi/pull/5664#issuecomment-1146256754

   
   ## CI report:
   
   * 63c2aa08ecec2dbbe98823f2b88b52874346a085 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8997)
 
   * db498bb903cdd264f29c3db616dab75bccffddaf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9062)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-03 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1146256410

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-03 Thread GitBox



rahil-c commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1146255412

   `@hudi-bot run azure`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rahil-c commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-03 Thread GitBox



rahil-c commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1146255235

   
   @hudi-bot run azure
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5664: [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys

2022-06-03 Thread GitBox



hudi-bot commented on PR #5664:
URL: https://github.com/apache/hudi/pull/5664#issuecomment-1146224569

   
   ## CI report:
   
   * 63c2aa08ecec2dbbe98823f2b88b52874346a085 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8997)
 
   * db498bb903cdd264f29c3db616dab75bccffddaf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4186) Support Hudi with Spark 3.3

2022-06-03 Thread Udit Mehrotra (Jira)

Udit Mehrotra created HUDI-4186:
---

 Summary: Support Hudi with Spark 3.3
 Key: HUDI-4186
 URL: https://issues.apache.org/jira/browse/HUDI-4186
 Project: Apache Hudi
  Issue Type: Epic
  Components: spark
Reporter: Udit Mehrotra


Spark 3.3 voting is currently in progress and should like go through soon 
[https://github.com/apache/spark/tree/v3.3.0-rc4|https://github.com/apache/spark/tree/v3.3.0-rc4.]

We should support it for our next major release 0.12.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-541) Replace variables/comments named "data files" to "base file"

2022-06-03 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-541:
---
Fix Version/s: 0.12.0
   (was: 0.11.1)

> Replace variables/comments named "data files" to "base file"
> 
>
> Key: HUDI-541
> URL: https://issues.apache.org/jira/browse/HUDI-541
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, dev-experience
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: new-to-hudi, pull-request-available
> Fix For: 0.12.0
>
>
> Per cWiki design and arch page, we should converge on the same terminology.. 
> We have _HoodieBaseFile_.. we should ensure all variables of this type are 
> named _baseFile_ or _bf_ , as opposed to _dataFile_ or _df_. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] yihua commented on pull request #1650: [HUDI-541]: replaced dataFile/df with baseFile/bf throughout code base

2022-06-03 Thread GitBox



yihua commented on PR #1650:
URL: https://github.com/apache/hudi/pull/1650#issuecomment-1146205989

   @pratyakshsharma could you rebase the PR on the latest master given there 
are conflicts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin opened a new pull request, #5746: [WIP] Optimize performance of Column Stats filtering seq

2022-06-03 Thread GitBox



alexeykudinkin opened a new pull request, #5746:
URL: https://github.com/apache/hudi/pull/5746

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Currently from our benchmarking, Column Stats has somewhat static overhead 
of ~1-2s / table (~7 files). 
   This PR is taking a stab at eliminating this overhead:
   
- Avoiding capturing heavy objets in closure requiring extensive cleaning 
by Spark
- ...
   
   ## Brief change log
   
- TBD
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar closed pull request #1946: [HUDI-1176]Upgrade tp log4j2

2022-06-03 Thread GitBox



vinothchandar closed pull request #1946: [HUDI-1176]Upgrade tp log4j2
URL: https://github.com/apache/hudi/pull/1946


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on pull request #1946: [HUDI-1176]Upgrade tp log4j2

2022-06-03 Thread GitBox



vinothchandar commented on PR #1946:
URL: https://github.com/apache/hudi/pull/1946#issuecomment-1146145764

   Closing in favor of #5366 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on pull request #1637: [WIP] Adding benchmark for some of the write operations in Hudi using jmh

2022-06-03 Thread GitBox



vinothchandar commented on PR #1637:
URL: https://github.com/apache/hudi/pull/1637#issuecomment-1146144647

   Closing this over the other perf efforts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar closed pull request #1637: [WIP] Adding benchmark for some of the write operations in Hudi using jmh

2022-06-03 Thread GitBox



vinothchandar closed pull request #1637: [WIP] Adding benchmark for some of the 
write operations in Hudi using jmh
URL: https://github.com/apache/hudi/pull/1637


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar closed pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

2022-06-03 Thread GitBox



vinothchandar closed pull request #1514: [WIP] [HUDI-774] Addressing incorrect 
Spark to Avro schema generation
URL: https://github.com/apache/hudi/pull/1514


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on pull request #1514: [WIP] [HUDI-774] Addressing incorrect Spark to Avro schema generation

2022-06-03 Thread GitBox



vinothchandar commented on PR #1514:
URL: https://github.com/apache/hudi/pull/1514#issuecomment-1146144129

   I think the default values are not addressed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #5743: [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables

2022-06-03 Thread GitBox



alexeykudinkin commented on PR #5743:
URL: https://github.com/apache/hudi/pull/5743#issuecomment-1146142725

   @leesf can you please add a description to the PR and also the Jira?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4185) Evaluate alternatives to using "hoodie.properties" as state store for Metadata Table

2022-06-03 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4185:
--
Fix Version/s: 0.12.0

> Evaluate alternatives to using "hoodie.properties" as state store for 
> Metadata Table
> 
>
> Key: HUDI-4185
> URL: https://issues.apache.org/jira/browse/HUDI-4185
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.0
>
>
> Currently Metadata Table uses "hoodie.properties" file as a state-store 
> adding properties reflecting the state of the metadata table being indexed.
> This is creating some issues (for ex, HUDI-4138) in respect to the 
> "hoodie.properties" lifecycle as most of the already existing code assumes 
> that the file is (mostly) immutable.
> We should re-evaluate our usage of "hoodie.properties" as a state-store given 
> that it has ripple effects on the existing components.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4185) Evaluate alternatives to using "hoodie.properties" as state store for Metadata Table

2022-06-03 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4185:
--
Priority: Blocker  (was: Major)

> Evaluate alternatives to using "hoodie.properties" as state store for 
> Metadata Table
> 
>
> Key: HUDI-4185
> URL: https://issues.apache.org/jira/browse/HUDI-4185
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Blocker
>
> Currently Metadata Table uses "hoodie.properties" file as a state-store 
> adding properties reflecting the state of the metadata table being indexed.
> This is creating some issues (for ex, HUDI-4138) in respect to the 
> "hoodie.properties" lifecycle as most of the already existing code assumes 
> that the file is (mostly) immutable.
> We should re-evaluate our usage of "hoodie.properties" as a state-store given 
> that it has ripple effects on the existing components.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HUDI-4185) Evaluate alternatives to using "hoodie.properties" as state store for Metadata Table

2022-06-03 Thread Alexey Kudinkin (Jira)

Alexey Kudinkin created HUDI-4185:
-

 Summary: Evaluate alternatives to using "hoodie.properties" as 
state store for Metadata Table
 Key: HUDI-4185
 URL: https://issues.apache.org/jira/browse/HUDI-4185
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Sagar Sumit


Currently Metadata Table uses "hoodie.properties" file as a state-store adding 
properties reflecting the state of the metadata table being indexed.

This is creating some issues (for ex, HUDI-4138) in respect to the 
"hoodie.properties" lifecycle as most of the already existing code assumes that 
the file is (mostly) immutable.

We should re-evaluate our usage of "hoodie.properties" as a state-store given 
that it has ripple effects on the existing components.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HUDI-4184) Creating external table in Spark SQL modifies "hoodie.properties"

2022-06-03 Thread Alexey Kudinkin (Jira)

Alexey Kudinkin created HUDI-4184:
-

 Summary: Creating external table in Spark SQL modifies 
"hoodie.properties"
 Key: HUDI-4184
 URL: https://issues.apache.org/jira/browse/HUDI-4184
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Sagar Sumit


My setup was like following:
 # There's a table existing in one AWS account
 # I'm trying to access that table from Spark SQL from _another_ AWS account 
that only has Read permissions to the bucket with the table.
 # Now when issuing "CREATE TABLE" Spark SQL command it fails b/c Hudi tries to 
modify "hoodie.properties" file for whatever reason, even though i'm not 
modifying the table and just trying to create table in the catalog.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] yadav-jai opened a new pull request, #5745: [MINOR][UI]Resized the blog images , tags onHover:blue, readingTime only visible in blogs page

2022-06-03 Thread GitBox



yadav-jai opened a new pull request, #5745:
URL: https://github.com/apache/hudi/pull/5745

   
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   The images earlier had distorted width so fixed their sizing so that they 
look clear. 
   https://user-images.githubusercontent.com/97013124/171902828-7306133c-6429-48b4-a78a-7f542fa00cdd.png";>
   
   Removed read time from blogs list page such that it will only be visible in 
the blogs page
   
   Made the tags black by default and blue color when onHover. 
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-03 Thread GitBox



leesf commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889010463


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   > 
   
   @vinothchandar revert back to v1 means drop `HoodieCatalog`? If so, I do not 
think it is a good idea since users use 0.11.0 would use `HoodieCatalog` config 
and will change the behavior if we drop the `HoodieCatalog`. and please see my 
comment https://github.com/apache/hudi/pull/5737/files#r889018883 above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-03 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1146087261

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   * 3007879a9a938a65b1f7f9174c23f22f1bd82145 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9060)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] fzong76 commented on issue #5735: No hudi dataset was saved to s3

2022-06-03 Thread GitBox



fzong76 commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1146083416

   Yes. I just started with one table since it's the first time I tried 
HoodieMultiTableDeltaStreamer. There is no exception in the logs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #5728: [SUPPORT] Flink support Timeline-server-based marker

2022-06-03 Thread GitBox



danny0405 commented on issue #5728:
URL: https://github.com/apache/hudi/issues/5728#issuecomment-1146071893

   > 邮件已收到！谢谢！                       薛超
   
   [It should be supported, https://issues.apache.org/jira/browse/HUDI-2767
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-03 Thread GitBox



vinothchandar commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889028951


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   I am just asking for ideas to fix this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-06-03 Thread GitBox



hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1146049163

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * c4799803cff8adffef56e889a5cd4d52599fcf73 UNKNOWN
   * c5616888bb267cb505a12b88cad3e99f9dd18d9b UNKNOWN
   * c02afe06f4b0d02291112351f62b1f4046faccc1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9055)
 
   * 3007879a9a938a65b1f7f9174c23f22f1bd82145 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-03 Thread GitBox



leesf commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889010463


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   > 
   
   @vinothchandar revert back to v1 means drop `HoodieCatalog`? If so, I do not 
think it is a good idea since users use 0.11.0 would use `HoodieCatalog` config 
and will change the behavior if we drop the `HoodieCatalog` which is not a good 
idea. and please see my comment 
https://github.com/apache/hudi/pull/5737/files#r889018883 above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-03 Thread GitBox



leesf commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889018883


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>

Review Comment:
   the changes looks good and i think the changes here would solve the problem 
of passing schema to V1 implementation(means DefaultSource), and no need other 
changes? @alexeykudinkin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (HUDI-4178) HoodieSpark3Analysis does not pass schema from Spark Catalog

2022-06-03 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545899#comment-17545899
 ] 

leesf edited comment on HUDI-4178 at 6/3/22 2:46 PM:
-

[~alexey.kudinkin] `making them fetch the schema from storage (either from 
commit's metadata or data file) every time.` means fetch only once schema for a 
write operation or fetch many times for a write operation? and how much 
performance it affects while fetching from storage?


was (Author: xleesf):
[~alexey.kudinkin] `making them fetch the schema from storage (either from 
commit's metadata or data file) every time.` means fetch only once schema for a 
write operation or fetch many times for a write operation?

> HoodieSpark3Analysis does not pass schema from Spark Catalog
> 
>
> Key: HUDI-4178
> URL: https://issues.apache.org/jira/browse/HUDI-4178
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.11.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Currently, HoodieSpark3Analysis rule does not pass table's schema from the 
> Spark Catalog to Hudi's relations making them fetch the schema from storage 
> (either from commit's metadata or data file) every time.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (HUDI-4178) HoodieSpark3Analysis does not pass schema from Spark Catalog

2022-06-03 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545899#comment-17545899
 ] 

leesf edited comment on HUDI-4178 at 6/3/22 2:46 PM:
-

[~alexey.kudinkin] hi, `making them fetch the schema from storage (either from 
commit's metadata or data file) every time.` means fetch only once schema for a 
write operation or fetch many times for a write operation? and how much 
performance it affects while fetching from storage?


was (Author: xleesf):
[~alexey.kudinkin] `making them fetch the schema from storage (either from 
commit's metadata or data file) every time.` means fetch only once schema for a 
write operation or fetch many times for a write operation? and how much 
performance it affects while fetching from storage?

> HoodieSpark3Analysis does not pass schema from Spark Catalog
> 
>
> Key: HUDI-4178
> URL: https://issues.apache.org/jira/browse/HUDI-4178
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.11.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Currently, HoodieSpark3Analysis rule does not pass table's schema from the 
> Spark Catalog to Hudi's relations making them fetch the schema from storage 
> (either from commit's metadata or data file) every time.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HUDI-4178) HoodieSpark3Analysis does not pass schema from Spark Catalog

2022-06-03 Thread leesf (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545899#comment-17545899
 ] 

leesf commented on HUDI-4178:
-

[~alexey.kudinkin] `making them fetch the schema from storage (either from 
commit's metadata or data file) every time.` means fetch only once schema for a 
write operation or fetch many times for a write operation?

> HoodieSpark3Analysis does not pass schema from Spark Catalog
> 
>
> Key: HUDI-4178
> URL: https://issues.apache.org/jira/browse/HUDI-4178
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.11.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> Currently, HoodieSpark3Analysis rule does not pass table's schema from the 
> Spark Catalog to Hudi's relations making them fetch the schema from storage 
> (either from commit's metadata or data file) every time.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] leesf commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-03 Thread GitBox



leesf commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889010463


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   > 
   
   @vinothchandar revert back to v1 means drop `HoodieCatalog`? If so, I do not 
think it is a good idea since users use 0.11.0 would use `HoodieCatalog` config 
and will change the behavior if we drop the `HoodieCatalog` which is not a good 
idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bkosuru commented on issue #5741: [SUPPORT] Hudi table copy failed for some partitions in 0.11.0

2022-06-03 Thread GitBox



bkosuru commented on issue #5741:
URL: https://github.com/apache/hudi/issues/5741#issuecomment-1145972813

   No issues if I set option("hoodie.metadata.enable", false) for the writer in 
0.11.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

2022-06-03 Thread GitBox



YannByron commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1145952554

   @vinothchandar 
   > Actually what I proposed, everything uses CDC blocks. Just that when we 
are deriving on-the-fly we don't write before and after into the CDC blocks
   
   in this case, do you mean that  only `op` and `_hoodie_record_key` will be 
kept in the cdc block?  then iterator over this cdc block, and get the 
after-image value and the inserted value from the new file (base file or log 
file), get the before-image value and the deleted value from the previous file 
slice. 
   if so, IMO, the cdc blocks in this case can be omitted. Because we can 
iterator the log file or the base file (apply the filter `_hoodie_commit_time` 
= the current commit time), and continue the next operations.
   
   > everything uses CDC blocks.
   in my design that cdc block have the while cdc information, the cdc block 
will be written out only when the `HoodieMergeHandle` is called, not always. 
And other scenarios can re-use the existing files.
   be afraid there is still a gap about this, so i stress this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Fixing `HoodieSpark3Analysis` missing to pass schema from Spark Catalog

2022-06-03 Thread GitBox



YannByron commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r888915090


##
hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala:
##
@@ -45,16 +45,22 @@ case class HoodieSpark3Analysis(sparkSession: SparkSession) 
extends Rule[Logical
   with SparkAdapterSupport with ProvidesHoodieConfig {
 
   override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsDown {
-case dsv2 @ DataSourceV2Relation(d: HoodieInternalV2Table, _, _, _, _) =>
-  val output = dsv2.output
-  val catalogTable = if (d.catalogTable.isDefined) {
-Some(d.v1Table)
-  } else {
-None
-  }
-  val relation = new DefaultSource().createRelation(new 
SQLContext(sparkSession),
-buildHoodieConfig(d.hoodieCatalogTable))
-  LogicalRelation(relation, output, catalogTable, isStreaming = false)
+// NOTE: This step is required since Hudi relations don't currently 
implement DS V2 Read API
+case dsv2 @ DataSourceV2Relation(tbl: HoodieInternalV2Table, _, _, _, _) =>
+  val qualifiedTableName = QualifiedTableName(tbl.v1Table.database, 
tbl.v1Table.identifier.table)
+  val catalog = sparkSession.sessionState.catalog
+
+  catalog.getCachedPlan(qualifiedTableName, () => {

Review Comment:
   no. v1 and v1 are just different in the internal workings of spark, and 
having a few more configurations. The interface to the user is not actually 
affected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ws-dohashi commented on issue #2089: Reading MOR Tables - Not Working

2022-06-03 Thread GitBox



ws-dohashi commented on issue #2089:
URL: https://github.com/apache/hudi/issues/2089#issuecomment-1145923835

   @harishchanderramesh Hi! wondering if AWS was able to eventually resolve 
this for you and if so how you did it? We are currently running into a similiar 
`org.apache.http.NoHttpResponseException: The target server failed to respond` 
issue and hoping if you were able to find a resolution it could help us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] GnsCy commented on issue #5729: [SUPPORT] Environment issues when running Demo for v0.11

2022-06-03 Thread GitBox



GnsCy commented on issue #5729:
URL: https://github.com/apache/hudi/issues/5729#issuecomment-1145872951

   @xushiyan wouldn't running the demo on docker eliminate any environment 
setup discrepancies?
   I am running the setup on a clean ubuntu OS.
   
   ps. Btw I manage to run the same setup successfully for v0.10.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on issue #5735: No hudi dataset was saved to s3

2022-06-03 Thread GitBox



pratyakshsharma commented on issue #5735:
URL: https://github.com/apache/hudi/issues/5735#issuecomment-1145794222

   @fzong76 Do you see any exception in the logs?
   
   > but failed with --class 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer when 
trying to load multiple tables.
   
   I see you are only trying to load a single table `fei_hudi_test.table1` as 
mentioned in the config file but you mentioned "trying to load multiple 
tables". Am I missing something here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4183) Fix using HoodieCatalog to create non-hudi tables

2022-06-03 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-4183.
---
Resolution: Fixed

> Fix using HoodieCatalog to create non-hudi tables
> -
>
> Key: HUDI-4183
> URL: https://issues.apache.org/jira/browse/HUDI-4183
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4183) Fix using HoodieCatalog to create non-hudi tables

2022-06-03 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-4183:

Fix Version/s: 0.12.0

> Fix using HoodieCatalog to create non-hudi tables
> -
>
> Key: HUDI-4183
> URL: https://issues.apache.org/jira/browse/HUDI-4183
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Resolved] (HUDI-4183) Fix using HoodieCatalog to create non-hudi tables

2022-06-03 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-4183.
-

> Fix using HoodieCatalog to create non-hudi tables
> -
>
> Key: HUDI-4183
> URL: https://issues.apache.org/jira/browse/HUDI-4183
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [hudi] yuzhaojing commented on issue #5728: [SUPPORT] Flink support Timeline-server-based marker

2022-06-03 Thread GitBox



yuzhaojing commented on issue #5728:
URL: https://github.com/apache/hudi/issues/5728#issuecomment-1145776631

   This is a great proposal, I will support this feature before the 0.12 
release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables (#5743)

2022-06-03 Thread leesf

This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 3759a38b99 [HUDI-4183] Fix using HoodieCatalog to create non-hudi 
tables (#5743)
3759a38b99 is described below

commit 3759a38b99cf9bb7540cd1881879cc0547a25e70
Author: leesf <490081...@qq.com>
AuthorDate: Fri Jun 3 17:16:48 2022 +0800

[HUDI-4183] Fix using HoodieCatalog to create non-hudi tables (#5743)
---
 .../apache/spark/sql/hudi/TestCreateTable.scala| 31 ++
 .../spark/sql/hudi/catalog/HoodieCatalog.scala | 10 ---
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
index cad30eca24..7091de4a8e 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
@@ -781,4 +781,35 @@ class TestCreateTable extends HoodieSparkSqlTestBase {
 val tablePath = s"${dbPath}/${tableName}"
 assertResult(false)(existsPath(tablePath))
   }
+
+  test("Test Create Non-Hudi Table(Parquet Table)") {
+val databaseName = "test_database"
+spark.sql(s"create database if not exists $databaseName")
+spark.sql(s"use $databaseName")
+
+val tableName = generateTableName
+// Create a managed table
+spark.sql(
+  s"""
+ | create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ | ) using parquet
+   """.stripMargin)
+val table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName))
+assertResult(tableName)(table.identifier.table)
+assertResult("parquet")(table.provider.get)
+assertResult(CatalogTableType.MANAGED)(table.tableType)
+assertResult(
+  Seq(
+StructField("id", IntegerType),
+StructField("name", StringType),
+StructField("price", DoubleType),
+StructField("ts", LongType))
+)(table.schema.fields)
+
+spark.sql("use default")
+  }
 }
diff --git 
a/hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala
 
b/hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala
index 67012c7723..e1c2f228fa 100644
--- 
a/hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala
+++ 
b/hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala
@@ -118,9 +118,13 @@ class HoodieCatalog extends DelegatingCatalogExtension
schema: StructType,
partitions: Array[Transform],
properties: util.Map[String, String]): Table = {
-val locUriAndTableType = deduceTableLocationURIAndTableType(ident, 
properties)
-createHoodieTable(ident, schema, locUriAndTableType, partitions, 
properties,
-  Map.empty, Option.empty, TableCreationMode.CREATE)
+if (sparkAdapter.isHoodieTable(properties)) {
+  val locUriAndTableType = deduceTableLocationURIAndTableType(ident, 
properties)
+  createHoodieTable(ident, schema, locUriAndTableType, partitions, 
properties,
+Map.empty, Option.empty, TableCreationMode.CREATE)
+} else {
+  super.createTable(ident, schema, partitions, properties)
+}
   }
 
   override def tableExists(ident: Identifier): Boolean = 
super.tableExists(ident)

[GitHub] [hudi] leesf merged pull request #5743: [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables

2022-06-03 Thread GitBox



leesf merged PR #5743:
URL: https://github.com/apache/hudi/pull/5743


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] BuddyJack commented on issue #5728: [SUPPORT] Flink support Timeline-server-based marker

2022-06-03 Thread GitBox



BuddyJack commented on issue #5728:
URL: https://github.com/apache/hudi/issues/5728#issuecomment-1145747140

   邮件已收到！谢谢！                       薛超


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #5728: [SUPPORT] Flink support Timeline-server-based marker

2022-06-03 Thread GitBox



xushiyan commented on issue #5728:
URL: https://github.com/apache/hudi/issues/5728#issuecomment-1145746881

   @yuzhaojing can you please take this and advise? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #5729: [SUPPORT] Environment issues when running Demo for v0.11

2022-06-03 Thread GitBox



xushiyan commented on issue #5729:
URL: https://github.com/apache/hudi/issues/5729#issuecomment-1145745363

   I suspect this is discrepancy with your environment setup. We have 
integration test end to end running with docker demo for every commits. And we 
certainly tested deltastreamer 
org.apache.hudi.integ.ITTestHoodieDemo#ingestFirstBatchAndHiveSync


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 104 matches

Mail list logo