[GitHub] [hudi] nsivabalan commented on a change in pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

GitBox Wed, 19 Jan 2022 16:52:14 -0800


nsivabalan commented on a change in pull request #4606:
URL: https://github.com/apache/hudi/pull/4606#discussion_r788247581




##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
##########
@@ -207,41 +199,71 @@
       .withDocumentation("Enable use z-ordering/space-filling curves to 
optimize the layout of table to boost query performance. "
           + "This parameter takes precedence over clustering strategy set 
using " + EXECUTION_STRATEGY_CLASS_NAME.key());
 
-  public static final ConfigProperty LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty
+  /**
+   * Determines ordering strategy in for records layout optimization.
+   * Currently, following strategies are supported
+   * <ul>
+   *   <li>Linear: simply orders records lexicographically</li>
+   *   <li>Z-order: orders records along Z-order spatial-curve</li>
+   *   <li>Hilbert: orders records along Hilbert's spatial-curve</li>
+   * </ul>
+   *
+   * NOTE: "z-order", "hilbert" strategies may consume considerably more 
compute, than "linear".
+   *       Make sure to perform small-scale local testing for your dataset 
before applying globally.
+   */
+  public static final ConfigProperty<String> LAYOUT_OPTIMIZE_STRATEGY = 
ConfigProperty
       .key(LAYOUT_OPTIMIZE_PARAM_PREFIX + "strategy")
       .defaultValue("z-order")
       .sinceVersion("0.10.0")
-      .withDocumentation("Type of layout optimization to be applied, current 
only supports `z-order` and `hilbert` curves.");
+      .withDocumentation("Determines ordering strategy used in records layout 
optimization. "
+          + "Currently supported strategies are \"linear\", \"z-order\" and 
\"hilbert\" values are supported.");
 
   /**
-   * There exists two method to build z-curve.
-   * one is directly mapping sort cols to z-value to build z-curve;
-   * we can find this method in Amazon DynamoDB 
https://aws.amazon.com/cn/blogs/database/tag/z-order/
-   * the other one is Boundary-based Interleaved Index method which we 
proposed. simply call it sample method.
-   * Refer to rfc-28 for specific algorithm flow.
-   * Boundary-based Interleaved Index method has better generalization, but 
the build speed is slower than direct method.
+   * NOTE: This setting only has effect if {@link #LAYOUT_OPTIMIZE_STRATEGY} 
value is set to
+   *       either "z-order" or "hilbert" (ie leveraging space-filling curves)
+   *
+   * Currently, two methods to order records along the curve are supported 
"build" and "sample":

Review comment:
       is it "Direct" instead of "build" ?

##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -134,16 +134,28 @@ public MultipleSparkJobExecutionStrategy(HoodieTable 
table, HoodieEngineContext
    * @return {@link RDDCustomColumnsSortPartitioner} if sort columns are 
provided, otherwise empty.
    */
   protected Option<BulkInsertPartitioner<T>> getPartitioner(Map<String, 
String> strategyParams, Schema schema) {
-    if (getWriteConfig().isLayoutOptimizationEnabled()) {
-      // sort input records by z-order/hilbert
-      return Option.of(new 
RDDSpatialCurveOptimizationSortPartitioner((HoodieSparkEngineContext) 
getEngineContext(),
-          getWriteConfig(), HoodieAvroUtils.addMetadataFields(schema)));
-    } else if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) {
-      return Option.of(new 
RDDCustomColumnsSortPartitioner(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(","),
-          HoodieAvroUtils.addMetadataFields(schema), 
getWriteConfig().isConsistentLogicalTimestampEnabled()));
-    } else {
-      return Option.empty();
-    }
+    Option<String[]> orderByColumnsOpt =
+        Option.ofNullable(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()))
+            .map(listStr -> listStr.split(","));
+
+    return orderByColumnsOpt.map(orderByColumns -> {

Review comment:
       what happens is sort columns config is null or set to empty string




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering

Reply via email to