date:20220729

[GitHub] [hudi] hudi-bot commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql

2022-07-29 Thread GitBox



hudi-bot commented on PR #6225:
URL: https://github.com/apache/hudi/pull/6225#issuecomment-1200087165

   
   ## CI report:
   
   * de8c1ae0ed8433f13e2f2e3087bc31499a9b3c05 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] soma1712 commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

2022-07-29 Thread GitBox



soma1712 commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1200084568

   hudi_read.txt is actually a .py file. As the system was not supporting to 
update a .py, I had to change it to .txt
   
   [hudi_read.txt](https://github.com/apache/hudi/files/9224774/hudi_read.txt)
   [results.txt](https://github.com/apache/hudi/files/9224775/results.txt)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql

2022-07-29 Thread GitBox



hudi-bot commented on PR #6225:
URL: https://github.com/apache/hudi/pull/6225#issuecomment-1200078793

   
   ## CI report:
   
   * de8c1ae0ed8433f13e2f2e3087bc31499a9b3c05 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10429)
 
   *  Unknown: [CANCELED](TBD) 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql

2022-07-29 Thread GitBox



hudi-bot commented on PR #6225:
URL: https://github.com/apache/hudi/pull/6225#issuecomment-1200078230

   
   ## CI report:
   
   * de8c1ae0ed8433f13e2f2e3087bc31499a9b3c05 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10429)
 
   *  Unknown: [CANCELED](TBD) 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql

2022-07-29 Thread GitBox



hudi-bot commented on PR #6225:
URL: https://github.com/apache/hudi/pull/6225#issuecomment-1200077692

   
   ## CI report:
   
   * de8c1ae0ed8433f13e2f2e3087bc31499a9b3c05 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [MINOR] Fix convertPathWithScheme tests (#6251)

2022-07-29 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new c9725899c3 [MINOR] Fix convertPathWithScheme tests (#6251)
c9725899c3 is described below

commit c9725899c3f9516412dcc683875d81ac226d9b45
Author: Y Ethan Guo 
AuthorDate: Fri Jul 29 19:26:30 2022 -0700

[MINOR] Fix convertPathWithScheme tests (#6251)
---
 .../test/java/org/apache/hudi/common/fs/TestStorageSchemes.java| 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java 
b/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java
index 9b173254ac..354ad6d0cc 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java
@@ -69,6 +69,11 @@ public class TestStorageSchemes {
 assertEquals(s3TablePath3, 
HoodieWrapperFileSystem.convertPathWithScheme(s3TablePath3, "s3"));
 
 Path hdfsTablePath = new 
Path("hdfs://sandbox.foo.com:8020/test.1234/table1");
-
System.out.println(HoodieWrapperFileSystem.convertPathWithScheme(hdfsTablePath, 
"hdfs"));
+assertEquals(hdfsTablePath, 
HoodieWrapperFileSystem.convertPathWithScheme(hdfsTablePath, "hdfs"));
+
+Path localTablePath = new Path("file:/var/table1");
+Path localTablePathNoPrefix = new Path("/var/table1");
+assertEquals(localTablePath, 
HoodieWrapperFileSystem.convertPathWithScheme(localTablePath, "file"));
+assertEquals(localTablePath, 
HoodieWrapperFileSystem.convertPathWithScheme(localTablePathNoPrefix, "file"));
   }
 }

[GitHub] [hudi] xushiyan merged pull request #6251: [MINOR] Fix convertPathWithScheme tests

2022-07-29 Thread GitBox



xushiyan merged PR #6251:
URL: https://github.com/apache/hudi/pull/6251


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan merged pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



xushiyan merged PR #6250:
URL: https://github.com/apache/hudi/pull/6250


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-4507] Improve file name extraction logic in metadata utils (#6250)

2022-07-29 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0f703a7e15 [HUDI-4507] Improve file name extraction logic in metadata 
utils (#6250)
0f703a7e15 is described below

commit 0f703a7e15833493037f7f7a07882cd73044ee65
Author: Y Ethan Guo 
AuthorDate: Fri Jul 29 19:25:57 2022 -0700

[HUDI-4507] Improve file name extraction logic in metadata utils (#6250)
---
 .../java/org/apache/hudi/common/fs/FSUtils.java | 18 ++
 .../hudi/metadata/HoodieTableMetadataUtil.java  | 21 -
 .../java/org/apache/hudi/common/fs/TestFSUtils.java | 12 ++--
 3 files changed, 32 insertions(+), 19 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index cfc143e3d0..d940f3bb45 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -615,6 +615,24 @@ public class FSUtils {
 return StringUtils.isNullOrEmpty(partitionPath) ? basePath : new 
Path(basePath, partitionPath);
   }
 
+  /**
+   * Extracts the file name from the relative path based on the table base 
path.  For example:
+   * "/2022/07/29/file1.parquet", "/2022/07/29" -> "file1.parquet"
+   * "2022/07/29/file2.parquet", "2022/07/29" -> "file2.parquet"
+   * "/file3.parquet", "" -> "file3.parquet"
+   * "file4.parquet", "" -> "file4.parquet"
+   *
+   * @param filePathWithPartition the relative file path based on the table 
base path.
+   * @param partition the relative partition path.  For 
partitioned table, `partition` contains the relative partition path;
+   *  for non-partitioned table, `partition` is 
empty
+   * @return Extracted file name in String.
+   */
+  public static String getFileName(String filePathWithPartition, String 
partition) {
+int offset = StringUtils.isNullOrEmpty(partition)
+? (filePathWithPartition.startsWith("/") ? 1 : 0) : partition.length() 
+ 1;
+return filePathWithPartition.substring(offset);
+  }
+
   /**
* Get DFS full partition path (e.g. hdfs://ip-address:8020:/)
*/
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
index d41f09990e..2c5b8db0ed 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
@@ -325,16 +325,13 @@ public class HoodieTableMetadataUtil {
   return map;
 }
 
-int offset = partition.equals(NON_PARTITIONED_NAME)
-? (pathWithPartition.startsWith("/") ? 1 : 0)
-: partition.length() + 1;
-String filename = pathWithPartition.substring(offset);
+String fileName = 
FSUtils.getFileName(pathWithPartition, partitionStatName);
 
 // Since write-stats are coming in no particular 
order, if the same
 // file have previously been appended to w/in the txn, 
we simply pick max
 // of the sizes as reported after every write, since 
file-sizes are
 // monotonically increasing (ie file-size never goes 
down, unless deleted)
-map.merge(filename, stat.getFileSizeInBytes(), 
Math::max);
+map.merge(fileName, stat.getFileSizeInBytes(), 
Math::max);
 
 return map;
   },
@@ -410,12 +407,7 @@ public class HoodieTableMetadataUtil {
 return Collections.emptyListIterator();
   }
 
-  // For partitioned table, "partition" contains the relative partition 
path;
-  // for non-partitioned table, "partition" is empty
-  int offset = StringUtils.isNullOrEmpty(partition)
-  ? (pathWithPartition.startsWith("/") ? 1 : 0) : partition.length() + 
1;
-
-  final String fileName = pathWithPartition.substring(offset);
+  String fileName = FSUtils.getFileName(pathWithPartition, partition);
   if (!FSUtils.isBaseFile(new Path(fileName))) {
 return Collections.emptyListIterator();
   }
@@ -1162,13 +1154,8 @@ public class HoodieTableMetadataUtil {
 
HoodieTableMetaClient datasetMetaClient,
 List 
columnsToIndex,
 boolean isDeleted) 
{
-String partitionName = getPartitionIdentifier(partitionPath);
-// NOTE: We

[GitHub] [hudi] xiarixiaoyao commented on issue #6243: [SUPPORT] sparksql mergeinto sqlstatment 'update set' not effect

2022-07-29 Thread GitBox



xiarixiaoyao commented on issue #6243:
URL: https://github.com/apache/hudi/issues/6243#issuecomment-1200067268

   @fujianhua168 
   the reason is that you have configure  preCombineField = 'ts'
   the old record has a bigger ts（1000） then the new record ts（900） ， so hudi 
will not merge the new record


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



hudi-bot commented on PR #6250:
URL: https://github.com/apache/hudi/pull/6250#issuecomment-1200066863

   
   ## CI report:
   
   * 129007ab3840f01ccafaf5ef73275301fcd6799f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10464)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4508) Repair bug: if a fileSlice has no baseFile, it will throw exception, when reading optimized querie to mor

2022-07-29 Thread sherhomhuang (Jira)

sherhomhuang created HUDI-4508:
--

 Summary: Repair bug: if a fileSlice has no baseFile, it will throw 
exception, when reading optimized querie to mor
 Key: HUDI-4508
 URL: https://issues.apache.org/jira/browse/HUDI-4508
 Project: Apache Hudi
  Issue Type: Bug
  Components: hive, trino-presto
Reporter: sherhomhuang
Assignee: sherhomhuang


When read partition with fileSlice without baseFile, it will throw exception, 
when reading optimized querie to mor. It should not be a exception, but query 
none for the fileSlice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] leesf commented on a diff in pull request #6245: [HUDI-4506] make BucketIndexPartitioner distribute data more balance

2022-07-29 Thread GitBox



leesf commented on code in PR #6245:
URL: https://github.com/apache/hudi/pull/6245#discussion_r933703835


##
.idea/vcs.xml:
##
@@ -1,20 +1,4 @@

[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



hudi-bot commented on PR #6250:
URL: https://github.com/apache/hudi/pull/6250#issuecomment-1200057185

   
   ## CI report:
   
   * 129007ab3840f01ccafaf5ef73275301fcd6799f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10464)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6251: [MINOR] Fix convertPathWithScheme tests

2022-07-29 Thread GitBox



hudi-bot commented on PR #6251:
URL: https://github.com/apache/hudi/pull/6251#issuecomment-1200056245

   
   ## CI report:
   
   * c52b9f9ac77b8493cdb0ae012a790b92cd0e4dcc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10465)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



hudi-bot commented on PR #6250:
URL: https://github.com/apache/hudi/pull/6250#issuecomment-1200056238

   
   ## CI report:
   
   * 129007ab3840f01ccafaf5ef73275301fcd6799f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more

2022-07-29 Thread GitBox



hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-1200056209

   
   ## CI report:
   
   * c771a314e72284d22cd682a48eb0013aaf09b3cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10466)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



xushiyan commented on code in PR #6250:
URL: https://github.com/apache/hudi/pull/6250#discussion_r933697770


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -325,16 +325,13 @@ public static List 
convertMetadataToFilesPartitionRecords(HoodieCo
   return map;
 }
 
-int offset = partition.equals(NON_PARTITIONED_NAME)
-? (pathWithPartition.startsWith("/") ? 1 : 0)
-: partition.length() + 1;
-String filename = pathWithPartition.substring(offset);
+String fileName = 
FSUtils.getFileName(pathWithPartition, partitionStatName);

Review Comment:
   ok i see. i was confused by the var name `partition`, which should actually 
be called `partitionIdentifier`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



yihua commented on code in PR #6250:
URL: https://github.com/apache/hudi/pull/6250#discussion_r933697289


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -325,16 +325,13 @@ public static List 
convertMetadataToFilesPartitionRecords(HoodieCo
   return map;
 }
 
-int offset = partition.equals(NON_PARTITIONED_NAME)
-? (pathWithPartition.startsWith("/") ? 1 : 0)
-: partition.length() + 1;
-String filename = pathWithPartition.substring(offset);
+String fileName = 
FSUtils.getFileName(pathWithPartition, partitionStatName);

Review Comment:
   We cannot use `partition` here which is generated by 
`getPartitionIdentifier(partitionStatName)`, changing the empty relative 
partition path to `.` partition identifier.  `getFileName()` expects plain 
relative partition path, instead of the partition identifier used in the 
metadata table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6236: [SUPPORT] facing an issue on querying Data in Hudi version 0.10.1 using AWS glue

2022-07-29 Thread GitBox



yihua commented on issue #6236:
URL: https://github.com/apache/hudi/issues/6236#issuecomment-1200049384

   @svaddoriya Have you tried to increase Spark memory settings?
   
   @rahil-c @zhedoubushishi @umehrot2 do you have any suggestions or best 
practices for querying the Hudi table with AWS Glue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



xushiyan commented on code in PR #6250:
URL: https://github.com/apache/hudi/pull/6250#discussion_r933696374


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -325,16 +325,13 @@ public static List 
convertMetadataToFilesPartitionRecords(HoodieCo
   return map;
 }
 
-int offset = partition.equals(NON_PARTITIONED_NAME)
-? (pathWithPartition.startsWith("/") ? 1 : 0)
-: partition.length() + 1;
-String filename = pathWithPartition.substring(offset);
+String fileName = 
FSUtils.getFileName(pathWithPartition, partitionStatName);

Review Comment:
   the new util getFileName() uses `partition`, right? why not pass `partition` 
to check? as how previously done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6243: [SUPPORT] sparksql mergeinto sqlstatment 'update set' not effect

2022-07-29 Thread GitBox



yihua commented on issue #6243:
URL: https://github.com/apache/hudi/issues/6243#issuecomment-1200048651

   @xiarixiaoyao @YannByron @XuQianJin-Stars any of you can help here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

2022-07-29 Thread GitBox



yihua commented on issue #6249:
URL: https://github.com/apache/hudi/issues/6249#issuecomment-1200047941

   @soma1712 could you share how you read the Hudi table in 
`s3://pythonscripts/hudi_read.py` and the full stacktrace as well?  Which Hudi 
release do you use?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more

2022-07-29 Thread GitBox



hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-1200043467

   
   ## CI report:
   
   * 745c015e848fde5d7a78c21e828af97705efa0d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10463)
 
   * c771a314e72284d22cd682a48eb0013aaf09b3cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10466)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more

2022-07-29 Thread GitBox



hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-1200042181

   
   ## CI report:
   
   * 745c015e848fde5d7a78c21e828af97705efa0d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10463)
 
   * c771a314e72284d22cd682a48eb0013aaf09b3cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



hudi-bot commented on PR #6250:
URL: https://github.com/apache/hudi/pull/6250#issuecomment-1200016266

   
   ## CI report:
   
   * 129007ab3840f01ccafaf5ef73275301fcd6799f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10464)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6251: [MINOR] Fix convertPathWithScheme tests

2022-07-29 Thread GitBox



hudi-bot commented on PR #6251:
URL: https://github.com/apache/hudi/pull/6251#issuecomment-1200012764

   
   ## CI report:
   
   * c52b9f9ac77b8493cdb0ae012a790b92cd0e4dcc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10465)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6251: [MINOR] Fix convertPathWithScheme tests

2022-07-29 Thread GitBox



hudi-bot commented on PR #6251:
URL: https://github.com/apache/hudi/pull/6251#issuecomment-1200011183

   
   ## CI report:
   
   * c52b9f9ac77b8493cdb0ae012a790b92cd0e4dcc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x

2022-07-29 Thread GitBox



hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-129286

   
   ## CI report:
   
   * 745c015e848fde5d7a78c21e828af97705efa0d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10463)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua opened a new pull request, #6251: [MINOR] Fix convertPathWithScheme tests

2022-07-29 Thread GitBox



yihua opened a new pull request, #6251:
URL: https://github.com/apache/hudi/pull/6251

   ## What is the purpose of the pull request
   
   This PR fixes the tests of `HoodieWrapperFileSystem.convertPathWithScheme`.
   
   ## Brief change log
   
 - Fixes tests in `TestStorageSchemes`
   
   ## Verify this pull request
   
   This change adds tests in `TestStorageSchemes`.
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



hudi-bot commented on PR #6250:
URL: https://github.com/apache/hudi/pull/6250#issuecomment-1199968512

   
   ## CI report:
   
   * 129007ab3840f01ccafaf5ef73275301fcd6799f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10464)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x

2022-07-29 Thread GitBox



hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-1199968461

   
   ## CI report:
   
   * 745c015e848fde5d7a78c21e828af97705efa0d0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10463)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

2022-07-29 Thread GitBox



yihua commented on code in PR #6113:
URL: https://github.com/apache/hudi/pull/6113#discussion_r933637330


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -409,8 +409,11 @@ public static HoodieData 
convertMetadataToBloomFilterRecords(
 LOG.error("Failed to find path in write stat to update metadata table 
" + hoodieWriteStat);
 return Collections.emptyListIterator();
   }
-  int offset = partition.equals(NON_PARTITIONED_NAME) ? 
(pathWithPartition.startsWith("/") ? 1 : 0) :
-  partition.length() + 1;
+
+  // For partitioned table, "partition" contains the relative partition 
path;
+  // for non-partitioned table, "partition" is empty
+  int offset = StringUtils.isNullOrEmpty(partition)

Review Comment:
   Addressed in #6250.  `String.replace` could be slow so I still use the 
current logic.  I moved it into a util method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1

2022-07-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4496:
-
Labels: pull-request-available  (was: )

> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



hudi-bot commented on PR #6250:
URL: https://github.com/apache/hudi/pull/6250#issuecomment-1199966586

   
   ## CI report:
   
   * 129007ab3840f01ccafaf5ef73275301fcd6799f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x

2022-07-29 Thread GitBox



hudi-bot commented on PR #6227:
URL: https://github.com/apache/hudi/pull/6227#issuecomment-1199966546

   
   ## CI report:
   
   * 745c015e848fde5d7a78c21e828af97705efa0d0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



yihua commented on code in PR #6250:
URL: https://github.com/apache/hudi/pull/6250#discussion_r933634375


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1162,13 +1154,8 @@ private static Stream 
getColumnStatsRecords(String partitionPath,
 
HoodieTableMetaClient datasetMetaClient,
 List 
columnsToIndex,
 boolean isDeleted) 
{
-String partitionName = getPartitionIdentifier(partitionPath);
-// NOTE: We have to chop leading "/" to make sure Hadoop does not treat it 
like
-//   absolute path
 String filePartitionPath = filePath.startsWith("/") ? 
filePath.substring(1) : filePath;
-String fileName = partitionName.equals(NON_PARTITIONED_NAME)
-? filePartitionPath
-: filePartitionPath.substring(partitionName.length() + 1);
+String fileName = FSUtils.getFileName(filePath, partitionPath);

Review Comment:
   The same here, using `partitionPath` directly instead of the partition 
identified.



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -325,16 +325,13 @@ public static List 
convertMetadataToFilesPartitionRecords(HoodieCo
   return map;
 }
 
-int offset = partition.equals(NON_PARTITIONED_NAME)
-? (pathWithPartition.startsWith("/") ? 1 : 0)
-: partition.length() + 1;
-String filename = pathWithPartition.substring(offset);
+String fileName = 
FSUtils.getFileName(pathWithPartition, partitionStatName);

Review Comment:
   Before the change, the `partition` identifier is used, instead of 
`partitionStatName`.  For a partitioned table, there is no difference; for a 
non-partitioned table, the `partition` identifier is `.` while 
`partitionStatName` could be empty or `/`.  The new logic depends on 
`partitionStatName` instead of `partition` identified, and the file name 
extracted is not affected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4507) Improve file name extraction logic in metadata utils

2022-07-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4507:
-
Labels: pull-request-available  (was: )

> Improve file name extraction logic in metadata utils
> 
>
> Key: HUDI-4507
> URL: https://issues.apache.org/jira/browse/HUDI-4507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> https://github.com/apache/hudi/pull/6113#discussion_r929275152



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] yihua opened a new pull request, #6250: [HUDI-4507] Improve file name extraction logic in metadata utils

2022-07-29 Thread GitBox



yihua opened a new pull request, #6250:
URL: https://github.com/apache/hudi/pull/6250

   ## What is the purpose of the pull request
   
   This PR improves file name extraction logic in metadata utils by adding a 
new util method.
   
   ## Brief change log
   
   - Adds a new util method `FSUtils.getFileName`
   - Refactors the logic of extracting file names in `HoodieTableMetadataUtil`
   - Adds a unit test for the new util method
   
   ## Verify this pull request
   
   This change adds a new test as mentioned above.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4496) ORC fails w/ Spark 3.1

2022-07-29 Thread Alexey Kudinkin (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573171#comment-17573171
 ] 

Alexey Kudinkin commented on HUDI-4496:
---

[https://github.com/apache/hudi/pull/6227]

> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4496) ORC fails w/ Spark 3.1

2022-07-29 Thread Alexey Kudinkin (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573170#comment-17573170
 ] 

Alexey Kudinkin commented on HUDI-4496:
---

[https://github.com/apache/hudi/pull/6227]

> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1

2022-07-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4496:
--
Status: Patch Available  (was: In Progress)

> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1

2022-07-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4496:
--
Story Points: 2  (was: 1)

> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1

2022-07-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4496:
--
Status: In Progress  (was: Open)

> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4507) Improve file name extraction logic in metadata utils

2022-07-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4507:

Summary: Improve file name extraction logic in metadata utils  (was: 
Improve filename extraction logic in metadata utils)

> Improve file name extraction logic in metadata utils
> 
>
> Key: HUDI-4507
> URL: https://issues.apache.org/jira/browse/HUDI-4507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.12.0
>
>
> https://github.com/apache/hudi/pull/6113#discussion_r929275152



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4507) Improve filename extraction logic in metadata utils

2022-07-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4507:

Fix Version/s: 0.12.0

> Improve filename extraction logic in metadata utils
> ---
>
> Key: HUDI-4507
> URL: https://issues.apache.org/jira/browse/HUDI-4507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4507) Improve filename extraction logic in metadata utils

2022-07-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4507:

Component/s: code-quality

> Improve filename extraction logic in metadata utils
> ---
>
> Key: HUDI-4507
> URL: https://issues.apache.org/jira/browse/HUDI-4507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4507) Improve filename extraction logic in metadata utils

2022-07-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4507:

Description: https://github.com/apache/hudi/pull/6113#discussion_r929275152

> Improve filename extraction logic in metadata utils
> ---
>
> Key: HUDI-4507
> URL: https://issues.apache.org/jira/browse/HUDI-4507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.12.0
>
>
> https://github.com/apache/hudi/pull/6113#discussion_r929275152



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4507) Improve filename extraction logic in metadata utils

2022-07-29 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-4507:
---

 Summary: Improve filename extraction logic in metadata utils
 Key: HUDI-4507
 URL: https://issues.apache.org/jira/browse/HUDI-4507
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] neerajpadarthi commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

2022-07-29 Thread GitBox



neerajpadarthi commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1199914603

   @yihua 
   
   Hey, I have verified the same in Hudi 0.10.1 but no luck still precision is 
getting truncated. Below are the configs, spark session details and spark/Hudi 
outputs. Could you please verify and let me know if anything is missing here? 
Thanks 
   
   ===Environment Details
   
   EMR: emr-6.6.0
   Hudi version : 0.10.1
   Spark version : Spark 3.2.0
   Hive version : Hive 3.1.2
   Hadoop version :Storage (HDFS/S3/GCS..) : S3
   Running on Docker? (yes/no) : no
   
   ===Spark Configs
   
   def create_spark_session():
spark = SparkSession \
.builder \
.config(“spark.sql.extensions”, 
“org.apache.spark.sql.hudi.HoodieSparkSessionExtension”) \
.config(“spark.sql.parquet.writeLegacyFormat”, “true”) \
.config(“spark.sql.parquet.outputTimestampType”, “TIMESTAMP_MICROS”) \
.config(“spark.sql.legacy.parquet.datetimeRebaseModeInRead”, “LEGACY”)\
.config(“spark.sql.legacy.parquet.int96RebaseModeInRead”,“LEGACY”)\
.enableHiveSupport()\
.getOrCreate()
   
   return spark
   
   ===Hudi Configs
   
   db_name = <>
   tableName = <>
   pk =<>
   de_dup =<>
   commonConfig = {‘hoodie.datasource.hive_sync.database’: 
db_name,‘hoodie.table.name’: 
tableName,‘hoodie.datasource.hive_sync.support_timestamp’: 
‘true’,‘hoodie.datasource.write.recordkey.field’: 
pk,‘hoodie.datasource.write.precombine.field’: 
de_dup,‘hoodie.datasource.hive_sync.enable’: 
‘true’,‘hoodie.datasource.hive_sync.table’: tableName}
   nonPartitionConfig = 
{‘hoodie.datasource.hive_sync.partition_extractor_class’:‘org.apache.hudi.hive.NonPartitionedExtractor’,‘hoodie.datasource.write.keygenerator.class’:‘org.apache.hudi.keygen.NonpartitionedKeyGenerator’}
   config = {‘hoodie.bulkinsert.shuffle.parallelism’: 
10,‘hoodie.datasource.write.operation’: 
‘bulk_insert’,‘hoodie.parquet.outputtimestamptype’:‘TIMESTAMP_MICROS’,
   #‘hoodie.datasource.write.row.writer.enable’:’false’}
   
   ===Spark DF Output
   +--+--+--+
   |id|creation_date |last_updated  |
   +--+--+--+
   |1340225   |2017-01-24 00:02:10   |2022-02-25 07:03:54.000853|
   |722b232f-e|2022-02-22 06:02:32.000481|2022-02-25 08:54:05.00042 |
   |53773de3-9|2022-02-25 07:21:06.37|2022-02-25 08:35:57.000877|
   +--+--+--+
   
   ===Hudi V0.10.1 Output
   
+---+-+--+--++--+---+---+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
   |id|creation_date  |last_updated 
  |
   
+---+-+--+--++--+---+---+
   |20220729201157281  |20220729201157281_1_2|53773de3-9|   
   
|55f7c820-c289-4eb7-aabc-4f079bd44536-0_1-11-10_20220729201157281.parquet|53773de3-9|2022-02-25
 07:21:06|2022-02-25 08:35:57|
   |20220729201157281  |20220729201157281_2_3|722b232f-e|   
   
|0dd8d6c2-9d64-40d7-a4db-bf7cf95bd02c-0_2-11-11_20220729201157281.parquet|722b232f-e|2022-02-22
 06:02:32|2022-02-25 08:54:05|
   |20220729201157281  |20220729201157281_0_1|1340225   |   
   |2e0cf27b-999d-4d5e-9c4e-52d27c25294e-0_0-9-9_20220729201157281.parquet  
|1340225   |2017-01-24 00:02:10|2022-02-25 07:03:54|
   
+---+-+--+--++--+---+---+


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency

2022-07-29 Thread GitBox



hudi-bot commented on PR #6228:
URL: https://github.com/apache/hudi/pull/6228#issuecomment-1199889169

   
   ## CI report:
   
   * 0cc2dbb39e432baf741bb3dd94c6d627cb250297 UNKNOWN
   * 6f055012562507406afe0ab0ec37e4a5388538f2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10462)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency

2022-07-29 Thread GitBox



xushiyan commented on code in PR #6228:
URL: https://github.com/apache/hudi/pull/6228#discussion_r933550437


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -172,37 +179,46 @@ public Pair>, String> 
fetchNextBatch(Option lastCkpt
 String s3FS = props.getString(Config.S3_FS_PREFIX, "s3").toLowerCase();
 String s3Prefix = s3FS + "://";
 
-// Extract distinct file keys from s3 meta hoodie table
-final List cloudMetaDf = source
+// Create S3 paths
+final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, 
Config.DEFAULT_ENABLE_EXISTS_CHECK);
+SerializableConfiguration serializableConfiguration = new 
SerializableConfiguration(sparkContext.hadoopConfiguration());
+List cloudFiles = source
 .filter(filter)
 .select("s3.bucket.name", "s3.object.key")
 .distinct()
-.collectAsList();
-// Create S3 paths
-final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, 
Config.DEFAULT_ENABLE_EXISTS_CHECK);
-List cloudFiles = new ArrayList<>();
-for (Row row : cloudMetaDf) {
-  // construct file path, row index 0 refers to bucket and 1 refers to key
-  String bucket = row.getString(0);
-  String filePath = s3Prefix + bucket + "/" + row.getString(1);
-  if (checkExists) {
-FileSystem fs = FSUtils.getFs(s3Prefix + bucket, 
sparkSession.sparkContext().hadoopConfiguration());
-try {
-  if (fs.exists(new Path(filePath))) {
-cloudFiles.add(filePath);
-  }
-} catch (IOException e) {
-  LOG.error(String.format("Error while checking path exists for %s ", 
filePath), e);
-}
-  } else {
-cloudFiles.add(filePath);
-  }
-}
+.mapPartitions((MapPartitionsFunction)  fileListIterator 
-> {
+  List cloudFilesPerPartition = new ArrayList<>();
+  final Configuration configuration = 
serializableConfiguration.newCopy();
+  fileListIterator.forEachRemaining(row -> {
+String bucket = row.getString(0);
+String filePath = s3Prefix + bucket + "/" + row.getString(1);
+String decodeUrl = null;
+try {
+  decodeUrl = URLDecoder.decode(filePath, 
StandardCharsets.UTF_8.name());
+  if (checkExists) {
+FileSystem fs = FSUtils.getFs(s3Prefix + bucket, 
configuration);
+if (fs.exists(new Path(decodeUrl))) {
+  cloudFilesPerPartition.add(decodeUrl);
+}
+  } else {
+cloudFilesPerPartition.add(decodeUrl);
+  }
+} catch (IOException e) {
+  LOG.error(String.format("Error while checking path exists for %s 
", decodeUrl), e);
+} catch (Throwable e) {
+  LOG.warn("Failed to add cloud file ", e);

Review Comment:
   didn't realize this before...in the original logic, any exception other than 
IOException will fail the fetch right? here it'll silence it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] soma1712 opened a new issue, #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

2022-07-29 Thread GitBox



soma1712 opened a new issue, #6249:
URL: https://github.com/apache/hudi/issues/6249

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   Detailed Notes - 
   
   We have Incoming Delta transactions from an Oracle based application that 
are being pushed into S3 endpoint using AWS DMS services. These CDC records are 
applied as upserts on to already existing Hudi table in a different S3 bucket 
(Initial Load data). The UPSERTS are happening by running below Spark Submits -
   
   spark-submit \
   --deploy-mode client \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.default.parallelism=500 \
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.initialExecutors=3 \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.app.name= \
   --jars 
/usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar 
/usr/lib/hudi/hudi-utilities-bundle.jar \
   --table-type MERGE_ON_READ \
   --op UPSERT \
   --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:1 \
   --source-ordering-field dms_seq_no \
   --props s3://bucket/cdc.properties \
   --hoodie-conf hoodie.datasource.hive_sync.database=glue_db \
   --target-base-path s3://bucket/table_1 \
   --target-table table_1 \
   --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/ \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
--enable-sync
   
   This table  will be subsequently read with hudi options and joined 
with other hudi tables to populate the Final Enriched layer. While reading a 
Hudi table we are facing the ArrayIndexOutOfbound exception.
   
   Below are the Hudi props and Spark Submits we execute to read and populate 
the downstream.
   
   hoodie.datasource.write.partitionpath.field=
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.assume_date_partitioning=false
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   hoodie.parquet.small.file.limit=134217728
   hoodie.parquet.max.file.size=1048576000
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.cleaner.commits.retained=1
   hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE 
FALSE END AS _hoodie_is_deleted,* from 
   hoodie.datasource.hive_sync.support_timestamp=true
   hoodie.datasource.compaction.async.enable=true
   hoodie.index.type=BLOOM
   hoodie.compact.inline=true
   hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5
   hoodie.metadata.compact.max.delta.commits=5
   hoodie.clean.automatic=true
   hoodie.clean.async=true
   hoodie.datasource.hive_sync.table=table_1
   hoodie.datasource.write.recordkey.field=table_1_ID
   
   spark-submit --deploy-mode client --conf 
spark.yarn.appMasterEnv.SPARK_HOME=/prod/null --conf 
spark.executorEnv.SPARK_HOME=/prod/null --conf 
spark.shuffle.service.enabled=true --jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar 
s3://pythonscripts/hudi_read.py
   
   
   
   TaskSetManager: Lost task 32.2 in stage 6.0 (TID 253) on 
ip-172-31-16-236.ec2.internal, executor 1: 
java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1]
   22/07/21 15:50:26 INFO TaskSetManager: Starting task 32.3 in stage 6.0 (TID 
296, ip-172-31-16-236.ec2.internal, executor 1, partition 32, PROCESS_LOCAL, 
8887 bytes)
   22/07/21 15:50:26 INFO TaskSetManager: Lost task 33.2 in stage 6.0 (TID 256) 
on ip-172-31-16-236.ec2.internal, executor 1: 
java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2]
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on

[GitHub] [hudi] hudi-bot commented on pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency

2022-07-29 Thread GitBox



hudi-bot commented on PR #6228:
URL: https://github.com/apache/hudi/pull/6228#issuecomment-1199833744

   
   ## CI report:
   
   * 0cc2dbb39e432baf741bb3dd94c6d627cb250297 UNKNOWN
   * e14bff1ef93f0c1fbbacf384d4fcaa3ef314050c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10434)
 
   * 6f055012562507406afe0ab0ec37e4a5388538f2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10462)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vamshigv commented on a diff in pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency

2022-07-29 Thread GitBox



vamshigv commented on code in PR #6228:
URL: https://github.com/apache/hudi/pull/6228#discussion_r933517693


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -172,37 +177,47 @@ public Pair>, String> 
fetchNextBatch(Option lastCkpt
 String s3FS = props.getString(Config.S3_FS_PREFIX, "s3").toLowerCase();
 String s3Prefix = s3FS + "://";
 
-// Extract distinct file keys from s3 meta hoodie table
-final List cloudMetaDf = source
+// Create S3 paths
+final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, 
Config.DEFAULT_ENABLE_EXISTS_CHECK);
+SerializableConfiguration serializableConfiguration = new 
SerializableConfiguration(sparkContext.hadoopConfiguration());
+List cloudFiles = source
 .filter(filter)
 .select("s3.bucket.name", "s3.object.key")
 .distinct()
-.collectAsList();
-// Create S3 paths
-final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, 
Config.DEFAULT_ENABLE_EXISTS_CHECK);
-List cloudFiles = new ArrayList<>();
-for (Row row : cloudMetaDf) {
-  // construct file path, row index 0 refers to bucket and 1 refers to key
-  String bucket = row.getString(0);
-  String filePath = s3Prefix + bucket + "/" + row.getString(1);
-  if (checkExists) {
-FileSystem fs = FSUtils.getFs(s3Prefix + bucket, 
sparkSession.sparkContext().hadoopConfiguration());
-try {
-  if (fs.exists(new Path(filePath))) {
-cloudFiles.add(filePath);
-  }
-} catch (IOException e) {
-  LOG.error(String.format("Error while checking path exists for %s ", 
filePath), e);
-}
-  } else {
-cloudFiles.add(filePath);
-  }
-}
+.rdd().toJavaRDD().mapPartitions(fileListIterator -> {
+  List cloudFilesPerPartition = new ArrayList<>();
+  fileListIterator.forEachRemaining(row -> {
+final Configuration configuration = 
serializableConfiguration.newCopy();
+String bucket = row.getString(0);
+String filePath = s3Prefix + bucket + "/" + row.getString(1);
+try {
+  String decodeUrl = URLDecoder.decode(filePath, 
StandardCharsets.UTF_8.name());
+  if (checkExists) {
+FileSystem fs = FSUtils.getFs(s3Prefix + bucket, 
configuration);
+try {

Review Comment:
   @xushiyan Simplified this nesting now. PTAL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency

2022-07-29 Thread GitBox



hudi-bot commented on PR #6228:
URL: https://github.com/apache/hudi/pull/6228#issuecomment-1199828862

   
   ## CI report:
   
   * 0cc2dbb39e432baf741bb3dd94c6d627cb250297 UNKNOWN
   * e14bff1ef93f0c1fbbacf384d4fcaa3ef314050c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10434)
 
   * 6f055012562507406afe0ab0ec37e4a5388538f2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vamshigv commented on a diff in pull request #6228: [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency

2022-07-29 Thread GitBox



vamshigv commented on code in PR #6228:
URL: https://github.com/apache/hudi/pull/6228#discussion_r933517139


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -172,37 +177,47 @@ public Pair>, String> 
fetchNextBatch(Option lastCkpt
 String s3FS = props.getString(Config.S3_FS_PREFIX, "s3").toLowerCase();
 String s3Prefix = s3FS + "://";
 
-// Extract distinct file keys from s3 meta hoodie table
-final List cloudMetaDf = source
+// Create S3 paths
+final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, 
Config.DEFAULT_ENABLE_EXISTS_CHECK);
+SerializableConfiguration serializableConfiguration = new 
SerializableConfiguration(sparkContext.hadoopConfiguration());
+List cloudFiles = source
 .filter(filter)
 .select("s3.bucket.name", "s3.object.key")
 .distinct()
-.collectAsList();
-// Create S3 paths
-final boolean checkExists = props.getBoolean(Config.ENABLE_EXISTS_CHECK, 
Config.DEFAULT_ENABLE_EXISTS_CHECK);
-List cloudFiles = new ArrayList<>();
-for (Row row : cloudMetaDf) {
-  // construct file path, row index 0 refers to bucket and 1 refers to key
-  String bucket = row.getString(0);
-  String filePath = s3Prefix + bucket + "/" + row.getString(1);
-  if (checkExists) {
-FileSystem fs = FSUtils.getFs(s3Prefix + bucket, 
sparkSession.sparkContext().hadoopConfiguration());
-try {
-  if (fs.exists(new Path(filePath))) {
-cloudFiles.add(filePath);
-  }
-} catch (IOException e) {
-  LOG.error(String.format("Error while checking path exists for %s ", 
filePath), e);
-}
-  } else {
-cloudFiles.add(filePath);
-  }
-}
+.rdd().toJavaRDD().mapPartitions(fileListIterator -> {

Review Comment:
   @xushiyan removed conversion to JavaRDD here. Applied mapPartitions on 
dataset directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leobiscassi commented on issue #6142: [SUPPORT] column ‘_hoodie_is_deleted’ query by presto exception

2022-07-29 Thread GitBox



leobiscassi commented on issue #6142:
URL: https://github.com/apache/hudi/issues/6142#issuecomment-1199808432

   Hey @qianchutao, I was able to fix this on my side and maybe the solution 
help you too.
   So, basically this error happens because a mismatch of order between the 
schema declared inside the parquet files and the table schema ddl on Athena / 
Presto. This normally works on Athena because the default method to map the 
columns on Athena is using the names [1], for Presto the default way is by 
column indexes [2], so when you have schema evolution or for some reason the 
order of columns doesn't match between the parquet files and the table schema, 
this starts to happen, nothing related to hudi itself.
   
   To fix this add the config `hive.parquet.use-column-names=true` under the 
EMR config tab or at start up time, this is going to update the config files 
and restart the presto cluster.  If you want to do this on a running cluster 
you'll need to do on master and worker nodes and restart presto, without doing 
that the config won't work.
   
   Let me know if this helps  
   
   [1] 
https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
   [2] 
https://stackoverflow.com/questions/60183579/presto-fails-with-type-mismatch-errors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] neerajpadarthi commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

2022-07-29 Thread GitBox



neerajpadarthi commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1199790011

   @yihua - I will validate with 0.10.1
   @YannByron - Thanks for checking. I have tested with below configs passing 
to spark session but I still see the same issue. 
   "spark.sql.parquet.outputTimestampType","TIMESTAMP_MICROS"
   "spark.sql.parquet.writeLegacyFormat", "true"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-07-29 Thread GitBox



hudi-bot commented on PR #5629:
URL: https://github.com/apache/hudi/pull/5629#issuecomment-1199619090

   
   ## CI report:
   
   * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN
   * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN
   * 279857485f18875cab94f72b5bf61522bdaecd31 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10458)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4032) Remove double file-listing in Hudi Relations

2022-07-29 Thread Alexey Kudinkin (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573043#comment-17573043
 ] 

Alexey Kudinkin commented on HUDI-4032:
---

This has been addressed by:

https://github.com/apache/hudi/pull/5722/files#

> Remove double file-listing in Hudi Relations
> 
>
> Key: HUDI-4032
> URL: https://issues.apache.org/jira/browse/HUDI-4032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-4032) Remove double file-listing in Hudi Relations

2022-07-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-4032.
-
Resolution: Fixed

> Remove double file-listing in Hudi Relations
> 
>
> Key: HUDI-4032
> URL: https://issues.apache.org/jira/browse/HUDI-4032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4032) Remove double file-listing in BaseFileOnlyRelation

2022-07-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4032:
--
Summary: Remove double file-listing in BaseFileOnlyRelation  (was: Remove 
double file-listing in SparkHoodieFileIndex)

> Remove double file-listing in BaseFileOnlyRelation
> --
>
> Key: HUDI-4032
> URL: https://issues.apache.org/jira/browse/HUDI-4032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4032) Remove double file-listing in Hudi Relations

2022-07-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4032:
--
Fix Version/s: 0.12.0
   (was: 0.13.0)

> Remove double file-listing in Hudi Relations
> 
>
> Key: HUDI-4032
> URL: https://issues.apache.org/jira/browse/HUDI-4032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4032) Remove double file-listing in Hudi Relations

2022-07-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4032:
--
Summary: Remove double file-listing in Hudi Relations  (was: Remove double 
file-listing in BaseFileOnlyRelation)

> Remove double file-listing in Hudi Relations
> 
>
> Key: HUDI-4032
> URL: https://issues.apache.org/jira/browse/HUDI-4032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] YannByron commented on pull request #6225: [HUDI-4487] support to create ro/rt table by spark sql

2022-07-29 Thread GitBox



YannByron commented on PR #6225:
URL: https://github.com/apache/hudi/pull/6225#issuecomment-1199575340

   @xushiyan can you help to call a new CI process. i execute `run azure` 
twice, but it just returns the first status.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (#6242)

2022-07-29 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e04b3188e2 [HUDI-4505] Returns instead of throws if lock file exists 
for FileSystemBasedLockProvider (#6242)
e04b3188e2 is described below

commit e04b3188e465eabed71ba19342cb92d10963
Author: Danny Chan 
AuthorDate: Fri Jul 29 23:32:19 2022 +0800

[HUDI-4505] Returns instead of throws if lock file exists for 
FileSystemBasedLockProvider (#6242)

To avoid unnecessary exception throws
---
 .../transaction/lock/FileSystemBasedLockProvider.java | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java
index 96a42e8409..4135ef9acd 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java
@@ -54,8 +54,8 @@ public class FileSystemBasedLockProvider implements 
LockProvider, Serial
   private static final String LOCK_FILE_NAME = "lock";
 
   private final int lockTimeoutMinutes;
-  private transient FileSystem fs;
-  private transient Path lockFile;
+  private final transient FileSystem fs;
+  private final transient Path lockFile;
   protected LockConfiguration lockConfiguration;
 
   public FileSystemBasedLockProvider(final LockConfiguration 
lockConfiguration, final Configuration configuration) {
@@ -87,8 +87,13 @@ public class FileSystemBasedLockProvider implements 
LockProvider, Serial
 try {
   synchronized (LOCK_FILE_NAME) {
 // Check whether lock is already expired, if so try to delete lock file
-if (fs.exists(this.lockFile) && checkIfExpired()) {
-  fs.delete(this.lockFile, true);
+if (fs.exists(this.lockFile)) {
+  if (checkIfExpired()) {
+fs.delete(this.lockFile, true);
+LOG.warn("Delete expired lock file: " + this.lockFile);
+  } else {
+return false;
+  }
 }
 acquireLock();
 return fs.exists(this.lockFile);
@@ -123,7 +128,7 @@ public class FileSystemBasedLockProvider implements 
LockProvider, Serial
 }
 try {
   long modificationTime = 
fs.getFileStatus(this.lockFile).getModificationTime();
-  if (System.currentTimeMillis() - modificationTime > lockTimeoutMinutes * 
60 * 1000) {
+  if (System.currentTimeMillis() - modificationTime > lockTimeoutMinutes * 
60 * 1000L) {
 return true;
   }
 } catch (IOException | HoodieIOException e) {

[GitHub] [hudi] danny0405 merged pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…

2022-07-29 Thread GitBox



danny0405 merged PR #6242:
URL: https://github.com/apache/hudi/pull/6242


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value

2022-07-29 Thread GitBox



hudi-bot commented on PR #6248:
URL: https://github.com/apache/hudi/pull/6248#issuecomment-1199503241

   
   ## CI report:
   
   * e74c3a80c33bfc25cc01514efebc3a2c8ba75eb9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6247: [MINOR] Add license header

2022-07-29 Thread GitBox



hudi-bot commented on PR #6247:
URL: https://github.com/apache/hudi/pull/6247#issuecomment-1199344177

   
   ## CI report:
   
   * 1c96edb81b1623c50975c8d3fd81241a81e40445 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10457)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value

2022-07-29 Thread GitBox



hudi-bot commented on PR #6248:
URL: https://github.com/apache/hudi/pull/6248#issuecomment-1199314567

   
   ## CI report:
   
   * e74c3a80c33bfc25cc01514efebc3a2c8ba75eb9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…

2022-07-29 Thread GitBox



hudi-bot commented on PR #6242:
URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199314485

   
   ## CI report:
   
   * 4887d5c40a6b62998d6f5e64e06e91a326129ff8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10450)
 
   * 0ba12547e49adc6b5c285a51b893242f4d1690f6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10459)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-07-29 Thread GitBox



hudi-bot commented on PR #5629:
URL: https://github.com/apache/hudi/pull/5629#issuecomment-1199313520

   
   ## CI report:
   
   * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN
   * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN
   * 652a0d666fe29487d3ce2c2ce1cef70dc443dd61 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10302)
 
   * 279857485f18875cab94f72b5bf61522bdaecd31 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10458)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value

2022-07-29 Thread GitBox



hudi-bot commented on PR #6248:
URL: https://github.com/apache/hudi/pull/6248#issuecomment-1199309745

   
   ## CI report:
   
   * e74c3a80c33bfc25cc01514efebc3a2c8ba75eb9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…

2022-07-29 Thread GitBox



hudi-bot commented on PR #6242:
URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199309652

   
   ## CI report:
   
   * 4887d5c40a6b62998d6f5e64e06e91a326129ff8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10450)
 
   * 0ba12547e49adc6b5c285a51b893242f4d1690f6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan opened a new pull request, #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value

2022-07-29 Thread GitBox



nsivabalan opened a new pull request, #6248:
URL: https://github.com/apache/hudi/pull/6248

   ## What is the purpose of the pull request
   
   From 0.12, we are standardizing the default partition value for hudi as 
"__HIVE_DEFAULT_PARTITION__". Previously, hudi was using "default" as the 
default value (i.e. if partition column is null, this fallback value will be 
used). The fix was put up so that query engines will not run into any class 
cast exception if original partition path fields are non string types. But 
after this fix, we might need to migrate older hudi tables. ie. if "default" 
partition exists, we have to rewrite it to "__HIVE_DEFAULT_PARTITION__". This 
patch is adding an upgrade step, where we detect such hudi tables and fail the 
upgrade. And added instructions on what needs to be done before upgrading. 
   
   ## Brief change log
   
   - Added FourToFiveUpgradeHandler to detect hudi tables with "default" 
partition and throwing exception. 
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
   - TestUpgradeDowngrade#testUpgradeFourtoFive
   - TestUpgradeDowngrade#testUpgradeFourtoFiveWithDefaultPartition
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4718: [HUDI-3345][RFC-36] Proposal for hudi metastore server.

2022-07-29 Thread GitBox



pratyakshsharma commented on code in PR #4718:
URL: https://github.com/apache/hudi/pull/4718#discussion_r933224773


##
rfc/rfc-36/rfc-36.md:
##
@@ -0,0 +1,605 @@
+
+# RFC-36: Hudi Metastore Server
+
+## Proposers
+
+- @minihippo
+
+## Approvers
+
+
+## Status
+
+JIRA: [HUDI-3345](https://issues.apache.org/jira/browse/HUDI-3345)
+
+> Please keep the status updated in `rfc/README.md`.
+
+# Hudi Metastore Server
+
+## Abstract
+
+Currently, Hudi is widely used as a table format in the data warehouse. There 
is a lack of central metastore server to manage the metadata of data lake 
table. Hive metastore as a commonly used catalog service in the data warehouse 
on Hadoop cannot store the unique metadata like timeline of the hudi table.
+
+The proposal is to implement an unified metadata management system called hudi 
metastore server to store the metadata of the hudi table, and be compatible 
with hive metastore so that other engines can access it without any changes.
+
+## Backgroud
+
+**How Hudi metadata is stored**
+
+The metadata of hudi are table location, configuration and schema, timeline 
generated by instants, metadata of each commit / instant, which records files 
created / updated, new records num and so on in this commit. Besides, the 
information of files in a hudi table is also a part of hudi metadata.

Review Comment:
   This discussion brings me to a high level question. Today column stats are 
already stored at a file level in metadata table. So do we intend to completely 
replace metadata table with this new metastore server? 
   Or do we intend to use metastore server only to store table level stats 
similar to how hive metastore does that? 
   
   Another possibility I can think of is just exposing endpoints via metastore 
service to interact with different partitions of metadata table as Vinoth 
pointed out in another comment.
   @minihippo 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4718: [HUDI-3345][RFC-36] Proposal for hudi metastore server.

2022-07-29 Thread GitBox



pratyakshsharma commented on code in PR #4718:
URL: https://github.com/apache/hudi/pull/4718#discussion_r933224773


##
rfc/rfc-36/rfc-36.md:
##
@@ -0,0 +1,605 @@
+
+# RFC-36: Hudi Metastore Server
+
+## Proposers
+
+- @minihippo
+
+## Approvers
+
+
+## Status
+
+JIRA: [HUDI-3345](https://issues.apache.org/jira/browse/HUDI-3345)
+
+> Please keep the status updated in `rfc/README.md`.
+
+# Hudi Metastore Server
+
+## Abstract
+
+Currently, Hudi is widely used as a table format in the data warehouse. There 
is a lack of central metastore server to manage the metadata of data lake 
table. Hive metastore as a commonly used catalog service in the data warehouse 
on Hadoop cannot store the unique metadata like timeline of the hudi table.
+
+The proposal is to implement an unified metadata management system called hudi 
metastore server to store the metadata of the hudi table, and be compatible 
with hive metastore so that other engines can access it without any changes.
+
+## Backgroud
+
+**How Hudi metadata is stored**
+
+The metadata of hudi are table location, configuration and schema, timeline 
generated by instants, metadata of each commit / instant, which records files 
created / updated, new records num and so on in this commit. Besides, the 
information of files in a hudi table is also a part of hudi metadata.

Review Comment:
   This discussion brings me to a high level question. Today column stats are 
already stored at a file level in metadata table. So do we intend to completely 
replace metadata table with this new metastore server? 
   Or do we intend to use metastore server only to store table level stats 
similar to how hive metastore does that? 
   @minihippo 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6247: [MINOR] Add license header

2022-07-29 Thread GitBox



hudi-bot commented on PR #6247:
URL: https://github.com/apache/hudi/pull/6247#issuecomment-1199242806

   
   ## CI report:
   
   * 1c96edb81b1623c50975c8d3fd81241a81e40445 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10457)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pratyakshsharma commented on a diff in pull request #4718: [HUDI-3345][RFC-36] Proposal for hudi metastore server.

2022-07-29 Thread GitBox



pratyakshsharma commented on code in PR #4718:
URL: https://github.com/apache/hudi/pull/4718#discussion_r933216702


##
rfc/rfc-36/rfc-36.md:
##
@@ -0,0 +1,605 @@
+
+# RFC-36: Hudi Metastore Server
+
+## Proposers
+
+- @minihippo
+
+## Approvers
+
+
+## Status
+
+JIRA: [HUDI-3345](https://issues.apache.org/jira/browse/HUDI-3345)
+
+> Please keep the status updated in `rfc/README.md`.
+
+# Hudi Metastore Server
+
+## Abstract
+
+Currently, Hudi is widely used as a table format in the data warehouse. There 
is a lack of central metastore server to manage the metadata of data lake 
table. Hive metastore as a commonly used catalog service in the data warehouse 
on Hadoop cannot store the unique metadata like timeline of the hudi table.
+
+The proposal is to implement an unified metadata management system called hudi 
metastore server to store the metadata of the hudi table, and be compatible 
with hive metastore so that other engines can access it without any changes.
+
+## Backgroud
+
+**How Hudi metadata is stored**
+
+The metadata of hudi are table location, configuration and schema, timeline 
generated by instants, metadata of each commit / instant, which records files 
created / updated, new records num and so on in this commit. Besides, the 
information of files in a hudi table is also a part of hudi metadata.
+
+Different from instant or schema recorded by a separate file that is stored 
under `${tablelocation}/.hoodie` on the HDFS or object storage, files info are 
managed by the HDFS directly. Hudi gets all files of a table by file listing. 
File listing is a costly operation and its performance is limited by namenode. 
In addition, there will be a few invalid files on the file system, which are 
created by spark speculative tasks(for example) and are not deleted 
successfully. Getting files by listing will result in inconsistency, so hudi 
has to store the valid files from each commit metadata, the metadata about 
files is usually referred to snapshot.
+
+RFC-15 metadata table is a proposal that can solve these problems. However, it 
only manages the metadata of one table. There is a lack of a unified view.
+
+**The integration of Hive metastore and Hudi metadata lacks a single source of 
truth.**
+
+Hive metastore server is widely used as a metadata center in the data 
warehouse on Hadoop. It stores the metadata for hive tables like their schema, 
location and partitions. Currently, almost all of the storage or computing 
engines support registering table information to it, discovering and retrieving 
metadata from it. Meanwhile, cloud service providers like AWS Glue, HUAWEI 
Cloud, Google Cloud Dataproc, Alibaba Cloud, ByteDance Volcano Engine all 
provide Apache Hive metastore compatible catalog. It seems that hive metastore 
has become a standard in the data warehouse.
+
+Different from the traditional table format like hive table, the data lake 
table not only has schema, partitions and other hive metadata, but also has 
timeline, snapshot which is unconventional. Hence, the metadata of data lake 
cannot be managed by HMS directly.
+
+Hudi just syncs the schema and partitions to HMS by now, and other metadata 
still stores on HDFS or object store. Metadata synchronization between 
different metadata management systems will result in inconsistency.
+
+## Overview
+
+![architecture](architecture.png)
+
+The hudi metastore server is for metadata management of the data lake table, 
to support metadata persistency, efficient metadata access and other extensions 
for data lake. The metadata server managed includes the information of 
databases and tables, partitions, schemas, instants, instants' meta and files' 
meta.
+
+The metastore server has two main components: service and storage. The storage 
is for metadata persistency and the service is to receive the get / put 
requests from client and return / store the processing result after doing some 
logical operations on metadata.
+
+The hudi metastore server is / has
+
+- **A metastore server for data lake**
+-  Different from the traditional table format, the metadata of the data 
lake has timeline and snapshot concepts, in addition to schema and partitions.
+
+-  The metastore server is an unified metadata management system for data 
lake table.
+
+- **Pluggable storage**
+-  The storage is only responsible for metadata presistency. Therefore, 
it's doesn't matter what the storage engine is used to store the data, it can 
be a RDBMS, kv system or file system.
+
+- **Easy to be expanded**
+-  The service is stateless, so it can be scaled horizontally to support 
higher QPS. The storage can be split vertically to store more data.
+
+- **Compatible with multiple computing engines**
+-  The server has an adapter to be compatible with hive metastore server.
+
+## Design
+
+This part has four sections: what the service does, what and how the metadata 
stores, how the service interacts with the storage when reading

[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

2022-07-29 Thread GitBox



wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r933207193


##
rfc/rfc-46/rfc-46.md:
##
@@ -84,59 +84,90 @@ is known to have poor performance (compared to 
non-reflection based instantiatio
 
  Record Merge API
 
-Stateless component interface providing for API Combining Records will look 
like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless 
component interface providing for API Combining Records will look like 
following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option merge(HoodieRecord older, HoodieRecord newer, Schema 
schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-* Spark-specific implementation 
-*/
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-  @Override
-  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-// HoodieSparkRecords preCombine
-  }
+   @Override
+   Option merge(HoodieRecord older, HoodieRecord newer, Schema 
schema, Properties props) throws IOException {
+ // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-  @Override
-  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) {
- // HoodieSparkRecord combineAndGetUpdateValue
-  }
+   @Override
+   HoodieRecordType getRecordType() {
+ return HoodieRecordType.SPARK;
}
+}

-   /**
-* Flink-specific implementation 
-*/
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-  @Override
-  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-// HoodieFlinkRecord preCombine
-  }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option merge(HoodieRecord older, HoodieRecord newer, Schema 
schema, Properties props) throws IOException {
+  // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-  @Override
-  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) {
- // HoodieFlinkRecord combineAndGetUpdateValue
-  }
+   @Override
+   HoodieRecordType getRecordType() {
+  return HoodieRecordType.FLINK;
}
+}
 ```
 Where user can provide their own subclass implementing such interface for the 
engines of interest.
 
- Migration from `HoodieRecordPayload` to `HoodieMerge`
+ Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created 
subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the 
form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the 
form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that 
will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 
0.11 release, however will be removing the performance 
 benefit of this refactoring, since it would unavoidably have to perform 
conversion to intermediate representation (Avro). To realize
 full-suite of benefits of this refactoring, users will have to migrate their 
merging logic out of `HoodieRecordPayload` subclass and into
-new `HoodieMerge` implementation.
+new `HoodieRecordMerger` implementation.
+
+Precombine is used to merge records from logs or incoming records; 
CombineAndGetUpdateValue is used to merge record from log file and record from 
base file.
+these two merge logics are not exactly the same for some RecordPayload, such 
as OverwriteWithLatestAvroPaload. 
+We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or 
WRITE). `HoodieAvroRecordMerger`'s API will look like following:

Review Comment:
   I think you're right. I've removed the mark(BASE, LOG or WRITE) in 
HoodieRecord and unified logic of HoodieSparkRecord.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6247: [MINOR] Add license header

2022-07-29 Thread GitBox



hudi-bot commented on PR #6247:
URL: https://github.com/apache/hudi/pull/6247#issuecomment-1199234091

   
   ## CI report:
   
   * 1c96edb81b1623c50975c8d3fd81241a81e40445 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…

2022-07-29 Thread GitBox



hudi-bot commented on PR #6242:
URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199234025

   
   ## CI report:
   
   * 4887d5c40a6b62998d6f5e64e06e91a326129ff8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10450)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-07-29 Thread GitBox



hudi-bot commented on PR #5629:
URL: https://github.com/apache/hudi/pull/5629#issuecomment-1199233077

   
   ## CI report:
   
   * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN
   * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN
   * 652a0d666fe29487d3ce2c2ce1cef70dc443dd61 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10302)
 
   * 279857485f18875cab94f72b5bf61522bdaecd31 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6246: Be able to disable precombine field when table schema contains a field named ts

2022-07-29 Thread GitBox



hudi-bot commented on PR #6246:
URL: https://github.com/apache/hudi/pull/6246#issuecomment-1199228989

   
   ## CI report:
   
   * 0c6cfaaeb0512d426753e989b6fcc72c5d79293b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10455)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6245: [HUDI-4506] make BucketIndexPartitioner distribute data more balance

2022-07-29 Thread GitBox



hudi-bot commented on PR #6245:
URL: https://github.com/apache/hudi/pull/6245#issuecomment-1199228946

   
   ## CI report:
   
   * 003df191ea86c299144f8a577ba817bb52ecd593 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10454)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…

2022-07-29 Thread GitBox



hudi-bot commented on PR #6242:
URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199228878

   
   ## CI report:
   
   * 4887d5c40a6b62998d6f5e64e06e91a326129ff8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10450)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] haripriyarhp commented on issue #6166: [SUPPORT] Missing records when using Kafka Hudi sink to write to S3.

2022-07-29 Thread GitBox



haripriyarhp commented on issue #6166:
URL: https://github.com/apache/hudi/issues/6166#issuecomment-1199226564

   @rmahindra123 :  Unfortunately, I am not able to share the .hoodie folder. 
Just to add, yesterday I tried it out again. I sent messages to a topic in 
batches. Below are the steps I followed
   1. Sent a batch of 100 records to kafka. Ran compaction. No.of messages in 
kafka and no.of records in Athena, matched.
   2. Sent a batch of another 100 records to Kafka -> Compaction -> no.of msgs 
in kafka = no.of records in Athena.
   3. Sent a batch of another 100 records (here there were some duplicates ) -> 
Compaction -> no.of.msgs in Kafka = no. of records in Athena.
   4. Sent another batch 98 records (some were duplicates) -> compaction -> 
no.of messages != no.of records in Athena.  There were no more files to be 
compacted. About 24 records were missing.
   5.  Sent another 100 records. -> compaction -> record count did not match. 
there was same 24 missing. 
   
   More or less, I followed the above steps several times before I raised the 
issue here. Each time, after few runs the record count does not match even 
after running compaction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope opened a new pull request, #6247: [MINOR] Add license header

2022-07-29 Thread GitBox



codope opened a new pull request, #6247:
URL: https://github.com/apache/hudi/pull/6247

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6242: [HUDI-4505] Returns instead of throws if lock file exists for FileSys…

2022-07-29 Thread GitBox



danny0405 commented on PR #6242:
URL: https://github.com/apache/hudi/pull/6242#issuecomment-1199208040

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-4504) Disable metadata table by default for flink

2022-07-29 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4504.
--

> Disable metadata table by default for flink
> ---
>
> Key: HUDI-4504
> URL: https://issues.apache.org/jira/browse/HUDI-4504
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4504) Disable metadata table by default for flink

2022-07-29 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572944#comment-17572944
 ] 

Danny Chen commented on HUDI-4504:
--

Fixed via master branch: a1cf401350ee7f8a66b4e927bce22b45a11260fc

> Disable metadata table by default for flink
> ---
>
> Key: HUDI-4504
> URL: https://issues.apache.org/jira/browse/HUDI-4504
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-4504] Disable metadata table by default for flink (#6241)

2022-07-29 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new a1cf401350 [HUDI-4504] Disable metadata table by default for flink 
(#6241)
a1cf401350 is described below

commit a1cf401350ee7f8a66b4e927bce22b45a11260fc
Author: Danny Chan 
AuthorDate: Fri Jul 29 20:06:24 2022 +0800

[HUDI-4504] Disable metadata table by default for flink (#6241)
---
 .../src/main/java/org/apache/hudi/configuration/FlinkOptions.java | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index 0984296ee5..933c112312 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -104,8 +104,8 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption METADATA_ENABLED = ConfigOptions
   .key("metadata.enabled")
   .booleanType()
-  .defaultValue(true)
-  .withDescription("Enable the internal metadata table which serves table 
metadata like level file listings, default enabled");
+  .defaultValue(false)
+  .withDescription("Enable the internal metadata table which serves table 
metadata like level file listings, default disabled");
 
   public static final ConfigOption METADATA_COMPACTION_DELTA_COMMITS 
= ConfigOptions
   .key("metadata.compaction.delta_commits")

[GitHub] [hudi] danny0405 merged pull request #6241: [HUDI-4504] Disable metadata table by default for flink

2022-07-29 Thread GitBox



danny0405 merged PR #6241:
URL: https://github.com/apache/hudi/pull/6241


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6241: [HUDI-4504] Disable metadata table by default for flink

2022-07-29 Thread GitBox



danny0405 commented on PR #6241:
URL: https://github.com/apache/hudi/pull/6241#issuecomment-1199200078

   
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10449=results
   
   The CI is actually green and i would merge it then ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

svn commit: r56033 - in /dev/hudi/hudi-0.12.0-rc1: ./ hudi-0.12.0-rc1.src.tgz hudi-0.12.0-rc1.src.tgz.asc hudi-0.12.0-rc1.src.tgz.sha512

2022-07-29 Thread codope

Author: codope
Date: Fri Jul 29 11:50:52 2022
New Revision: 56033

Log:
Add source distribution for 0.12.0-rc1

Added:
dev/hudi/hudi-0.12.0-rc1/
dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz   (with props)
dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.asc
dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.sha512

Added: dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz
==
Binary file - no diff available.

Propchange: dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz
--
svn:mime-type = application/octet-stream

Added: dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.asc
==
--- dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.asc (added)
+++ dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.asc Fri Jul 29 11:50:52 
2022
@@ -0,0 +1,14 @@
+-BEGIN PGP SIGNATURE-
+
+iQGzBAABCAAdFiEE/SFTQuMZlBmt+/Qd1GI+OqFtdbAFAmLjyHUACgkQ1GI+OqFt
+dbDUCAv+LzOubgFQQ3eDQtXZid+jPHbH1yLxLh9gLDkPRPE3eaUE9tMpl83d8zKU
+eY4kmD4Byax4FzQnFbcSSdWniXXh2cj5GVLYGO3EQirQ+evkY+ZSIP5JK2mrlJ9B
+ZlbPkC3S4egsxZVKE+ytMz4vvCvVgO3y19VfAmMvWyDq3st3aNDjmF4962RJUXoK
+oCr/6/6A56/q94qniLJR4XOAK49VZdsuuumBi8ldoSU5KraNtuCs8MLd13EyxcW4
+gYGtLp1qmvt21NT9YG5NI4XKIT9+/LAoX9P7q9DSkib7iyFn3wnZsvwiVhIsh94A
+UxNni1mjRGIerJkD3ZpHQZWdUsgpaqnQ9qROwIunsUMzr4stYOhEMzcq27Orl+uX
+rasgdjCGD6MV/TWbGpmU2qrd1CO976BkCJC0o5+2rSrmCl68atlNGT6XnCzg4Mkg
+zuGeSTNAyPNbDhhQHGHZyH9PV5HbMGq3Q4Vk4dk7Ke04AzT3Ru5NQ9kA8dlR4dYq
+qJ89SlsH
+=SXT4
+-END PGP SIGNATURE-

Added: dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.sha512
==
--- dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.sha512 (added)
+++ dev/hudi/hudi-0.12.0-rc1/hudi-0.12.0-rc1.src.tgz.sha512 Fri Jul 29 11:50:52 
2022
@@ -0,0 +1 @@
+ee289f7b0c26211e8b8d9d4f645c9ce01d8f4e75c71d998141626df2fd4adf8cef957a2959feaa06505a815ad582a039d7a67ccfff3b71a6c61733918c520486
  hudi-0.12.0-rc1.src.tgz

[hudi] annotated tag release-0.12.0-rc1 updated (3383b2388b -> 170eb40a62)

2022-07-29 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to annotated tag release-0.12.0-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.12.0-rc1 was modified! ***

from 3383b2388b (commit)
  to 170eb40a62 (tag)
 tagging 3383b2388bd0b107646edf38e98bdb5ee88281bc (commit)
 replaces hoodie-0.4.7
  by Sagar Sumit
  on Fri Jul 29 17:17:54 2022 +0530

- Log -
0.12.0
-BEGIN PGP SIGNATURE-

iI0EABYKADUWIQQ7EyGPQog2tHUAvYP7nRNrCtiI9gUCYuPI6hccc2FnYXJzdW1p
dDA5QGdtYWlsLmNvbQAKCRD7nRNrCtiI9pBQAP9uEKBKPDKzcNtC6qtGnNT08q0t
eMT5jCwNffY9+ztiSQEA45MDutsbEhWq+MVbE3tF9W+KzEj/Um7t4vaJ9tV49A8=
=4zJ1
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:

[hudi] annotated tag release-0.12.0-rc1 updated (3383b2388b -> 26d7199091)

2022-07-29 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to annotated tag release-0.12.0-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.12.0-rc1 was modified! ***

from 3383b2388b (commit)
  to 26d7199091 (tag)
 tagging 3383b2388bd0b107646edf38e98bdb5ee88281bc (commit)
 replaces hoodie-0.4.7
  by Sagar Sumit
  on Fri Jul 29 17:06:56 2022 +0530

- Log -
0.12.0
-BEGIN PGP SIGNATURE-

iI0EABYKADUWIQQ7EyGPQog2tHUAvYP7nRNrCtiI9gUCYuPGWBccc2FnYXJzdW1p
dDA5QGdtYWlsLmNvbQAKCRD7nRNrCtiI9tXnAP9gc/O/gDz0YztHMPtZ1wQDBZDG
HrbYlapQE/AFa39ZwQD/ZzEZbrKd70JkUD+QEKW0Dt/NGDzyClTZ4Q9Ridr/dwA=
=ipMK
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:

[hudi] tag release-0.12.0-rc1 created (now 3383b2388b)

2022-07-29 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to tag release-0.12.0-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git


  at 3383b2388b (commit)
No new revisions were added by this update.

[GitHub] [hudi] hudi-bot commented on pull request #6246: Be able to disable precombine field when table schema contains a field named ts

2022-07-29 Thread GitBox



hudi-bot commented on PR #6246:
URL: https://github.com/apache/hudi/pull/6246#issuecomment-1199131031

   
   ## CI report:
   
   * 0c6cfaaeb0512d426753e989b6fcc72c5d79293b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10455)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 >

1 - 100 of 339 matches

Mail list logo