Re: [PR] fix(spark): Add options for archive procedure [hudi]

via GitHub Thu, 21 May 2026 22:10:49 -0700


hudi-agent commented on code in PR #18437:
URL: https://github.com/apache/hudi/pull/18437#discussion_r3285956482



##########
hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/ArchiveExecutorUtils.java:
##########
@@ -53,12 +54,26 @@ public static int archive(JavaSparkContext jsc,
                             int maxCommits,
                             int commitsRetained,
                             boolean enableMetadata,
-                            String basePath) throws IOException {
+                            String basePath,
+                            Map<String, String> conf) throws IOException {

Review Comment:
   🤖 nit: the parameter is a map of override options, not a single config — 
could you rename `conf` to something like `extraConfigs` or `options` to match 
how the caller in `ArchiveCommitsProcedure` describes it? Singular `conf` reads 
as a single value.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/ArchiveExecutorUtils.java:
##########
@@ -73,4 +88,4 @@ public static int archive(JavaSparkContext jsc,
     }
     return 0;
   }
-}
+}

Review Comment:
   🤖 nit: looks like the trailing newline got dropped from this file (`\ No 
newline at end of file` in the diff). Could you add it back?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala:
##########
@@ -107,11 +107,56 @@ object HoodieCLIUtils extends Logging {
     }
   }
 
+  /**
+   * Parse a comma-separated string of key=value pairs into a Map.
+   *
+   * Notes:
+   *  - Whitespace surrounding keys/values is trimmed; empty tokens (e.g. from 
a
+   *    trailing comma or `", ,"`) are silently ignored.
+   *  - The delimiter is the first `=` in a token, so values may themselves
+   *    contain `=` (e.g. `k=a=b` parses to `k -> "a=b"`).
+   *  - Values cannot contain literal commas; the parser does not support
+   *    escaping. Configs that need commas should be set via Spark conf 
instead.
+   *  - If the same key appears more than once, a WARN is logged and the last
+   *    occurrence wins (consistent with `toMap`'s last-write-wins semantics).
+   *
+   * @throws IllegalArgumentException if a non-empty token does not contain `=`
+   *                                  or has an empty key.
+   */
   def extractOptions(s: String): Map[String, String] = {
-    StringUtils.split(s, ",").asScala
-      .map(split => StringUtils.split(split, "="))
-      .map(pair => pair.get(0) -> pair.get(1))
-      .toMap
+    if (s == null) {
+      Map.empty
+    } else {
+      val pairs = StringUtils.split(s, ",").asScala
+        .map(_.trim)
+        .filter(_.nonEmpty)
+        .map(token => {
+          val delimiterIndex = token.indexOf('=')
+          if (delimiterIndex <= 0) {
+            throw new IllegalArgumentException(
+              s"Invalid options format: '$token'. Expected 'key=value' pairs 
separated by commas, "
+                + "for example: 'k1=v1,k2=v2'.")
+          }
+
+          val key = token.substring(0, delimiterIndex).trim
+          if (key.isEmpty) {
+            throw new IllegalArgumentException(
+              s"Invalid options format: '$token'. Option key must not be empty 
and options should "
+                + "follow 'key=value' format.")
+          }
+
+          val value = token.substring(delimiterIndex + 1).trim
+          key -> value
+        })
+        .toSeq
+
+      val duplicates = pairs.groupBy(_._1).collect { case (k, vs) if vs.size > 
1 => k }
+      if (duplicates.nonEmpty) {
+        logWarning(s"Duplicate option keys detected: ${duplicates.mkString(", 
")}. "
+          + "The last occurrence will take effect.")

Review Comment:
   🤖 nit: `pairs` is built as a `Seq`, then traversed once for the duplicate 
check and again for `.toMap`. Could you compute duplicates inline while 
folding, or just use a single pass (e.g. `groupBy` once and derive both)? 
Minor, but it avoids the double traversal and the slightly awkward 
`toSeq`/`toMap` pair.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(spark): Add options for archive procedure [hudi]

Reply via email to