Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-07 Thread via GitHub


danny0405 merged PR #11150:
URL: https://github.com/apache/hudi/pull/11150


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-07 Thread via GitHub


danny0405 commented on code in PR #11150:
URL: https://github.com/apache/hudi/pull/11150#discussion_r1593352816


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -457,9 +457,10 @@ private Dataset 
readRecordsForGroupAsRow(JavaSparkContext jsc,
 
 String readPathString =
 String.join(",", 
Arrays.stream(paths).map(StoragePath::toString).toArray(String[]::new));
+String globPathString = String.join(",", 
Arrays.stream(paths).map(StoragePath::getParent).map(StoragePath::toString).distinct().toArray(String[]::new));
 params.put("hoodie.datasource.read.paths", readPathString);
 // Building HoodieFileIndex needs this param to decide query path
-params.put("glob.paths", readPathString);
+params.put("glob.paths", globPathString);
 

Review Comment:
   Fine, let's merge it first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-06 Thread via GitHub


the-other-tim-brown commented on code in PR #11150:
URL: https://github.com/apache/hudi/pull/11150#discussion_r1591567763


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -457,9 +457,10 @@ private Dataset 
readRecordsForGroupAsRow(JavaSparkContext jsc,
 
 String readPathString =
 String.join(",", 
Arrays.stream(paths).map(StoragePath::toString).toArray(String[]::new));
+String globPathString = String.join(",", 
Arrays.stream(paths).map(StoragePath::getParent).map(StoragePath::toString).distinct().toArray(String[]::new));
 params.put("hoodie.datasource.read.paths", readPathString);
 // Building HoodieFileIndex needs this param to decide query path
-params.put("glob.paths", readPathString);
+params.put("glob.paths", globPathString);
 

Review Comment:
   Looks like it is already covered there



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-05 Thread via GitHub


danny0405 commented on code in PR #11150:
URL: https://github.com/apache/hudi/pull/11150#discussion_r1590454993


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -457,9 +457,10 @@ private Dataset 
readRecordsForGroupAsRow(JavaSparkContext jsc,
 
 String readPathString =
 String.join(",", 
Arrays.stream(paths).map(StoragePath::toString).toArray(String[]::new));
+String globPathString = String.join(",", 
Arrays.stream(paths).map(StoragePath::getParent).map(StoragePath::toString).distinct().toArray(String[]::new));
 params.put("hoodie.datasource.read.paths", readPathString);
 // Building HoodieFileIndex needs this param to decide query path
-params.put("glob.paths", readPathString);
+params.put("glob.paths", globPathString);
 

Review Comment:
   Not sure whether `TestHoodieSparkMergeOnReadTableClustering` is the 
candidate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-04 Thread via GitHub


the-other-tim-brown commented on code in PR #11150:
URL: https://github.com/apache/hudi/pull/11150#discussion_r1590187272


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -457,9 +457,10 @@ private Dataset 
readRecordsForGroupAsRow(JavaSparkContext jsc,
 
 String readPathString =
 String.join(",", 
Arrays.stream(paths).map(StoragePath::toString).toArray(String[]::new));
+String globPathString = String.join(",", 
Arrays.stream(paths).map(StoragePath::getParent).map(StoragePath::toString).distinct().toArray(String[]::new));
 params.put("hoodie.datasource.read.paths", readPathString);
 // Building HoodieFileIndex needs this param to decide query path
-params.put("glob.paths", readPathString);
+params.put("glob.paths", globPathString);
 

Review Comment:
   I can't find a test class matching this class name. Is there a clustering 
test suite I should look in?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-04 Thread via GitHub


danny0405 commented on code in PR #11150:
URL: https://github.com/apache/hudi/pull/11150#discussion_r1590186996


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -457,9 +457,10 @@ private Dataset 
readRecordsForGroupAsRow(JavaSparkContext jsc,
 
 String readPathString =
 String.join(",", 
Arrays.stream(paths).map(StoragePath::toString).toArray(String[]::new));
+String globPathString = String.join(",", 
Arrays.stream(paths).map(StoragePath::getParent).map(StoragePath::toString).distinct().toArray(String[]::new));
 params.put("hoodie.datasource.read.paths", readPathString);
 // Building HoodieFileIndex needs this param to decide query path
-params.put("glob.paths", readPathString);
+params.put("glob.paths", globPathString);
 

Review Comment:
   do we have any test cases?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-04 Thread via GitHub


hudi-bot commented on PR #11150:
URL: https://github.com/apache/hudi/pull/11150#issuecomment-2094365713

   
   ## CI report:
   
   * 353708c54b454bf3749596f74267970f1c332b7b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23660)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-04 Thread via GitHub


hudi-bot commented on PR #11150:
URL: https://github.com/apache/hudi/pull/11150#issuecomment-2094339138

   
   ## CI report:
   
   * 11abd3eb1b9418d9013f820e3779f56c50810dfd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23658)
 
   * 353708c54b454bf3749596f74267970f1c332b7b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23660)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-04 Thread via GitHub


hudi-bot commented on PR #11150:
URL: https://github.com/apache/hudi/pull/11150#issuecomment-2094337137

   
   ## CI report:
   
   * 11abd3eb1b9418d9013f820e3779f56c50810dfd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23658)
 
   * 353708c54b454bf3749596f74267970f1c332b7b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-04 Thread via GitHub


hudi-bot commented on PR #11150:
URL: https://github.com/apache/hudi/pull/11150#issuecomment-2094335218

   
   ## CI report:
   
   * 11abd3eb1b9418d9013f820e3779f56c50810dfd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23658)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-04 Thread via GitHub


hudi-bot commented on PR #11150:
URL: https://github.com/apache/hudi/pull/11150#issuecomment-2094313377

   
   ## CI report:
   
   * 11abd3eb1b9418d9013f820e3779f56c50810dfd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23658)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Use parent as the glob path when full file path specified [hudi]

2024-05-04 Thread via GitHub


hudi-bot commented on PR #11150:
URL: https://github.com/apache/hudi/pull/11150#issuecomment-2094309158

   
   ## CI report:
   
   * 11abd3eb1b9418d9013f820e3779f56c50810dfd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org