alexeykudinkin commented on a change in pull request #4606: URL: https://github.com/apache/hudi/pull/4606#discussion_r788291798
########## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java ########## @@ -207,41 +199,71 @@ .withDocumentation("Enable use z-ordering/space-filling curves to optimize the layout of table to boost query performance. " + "This parameter takes precedence over clustering strategy set using " + EXECUTION_STRATEGY_CLASS_NAME.key()); - public static final ConfigProperty LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty + /** + * Determines ordering strategy in for records layout optimization. + * Currently, following strategies are supported + * <ul> + * <li>Linear: simply orders records lexicographically</li> + * <li>Z-order: orders records along Z-order spatial-curve</li> + * <li>Hilbert: orders records along Hilbert's spatial-curve</li> + * </ul> + * + * NOTE: "z-order", "hilbert" strategies may consume considerably more compute, than "linear". + * Make sure to perform small-scale local testing for your dataset before applying globally. + */ + public static final ConfigProperty<String> LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty .key(LAYOUT_OPTIMIZE_PARAM_PREFIX + "strategy") .defaultValue("z-order") .sinceVersion("0.10.0") - .withDocumentation("Type of layout optimization to be applied, current only supports `z-order` and `hilbert` curves."); + .withDocumentation("Determines ordering strategy used in records layout optimization. " + + "Currently supported strategies are \"linear\", \"z-order\" and \"hilbert\" values are supported."); /** - * There exists two method to build z-curve. - * one is directly mapping sort cols to z-value to build z-curve; - * we can find this method in Amazon DynamoDB https://aws.amazon.com/cn/blogs/database/tag/z-order/ - * the other one is Boundary-based Interleaved Index method which we proposed. simply call it sample method. - * Refer to rfc-28 for specific algorithm flow. - * Boundary-based Interleaved Index method has better generalization, but the build speed is slower than direct method. + * NOTE: This setting only has effect if {@link #LAYOUT_OPTIMIZE_STRATEGY} value is set to + * either "z-order" or "hilbert" (ie leveraging space-filling curves) + * + * Currently, two methods to order records along the curve are supported "build" and "sample": Review comment: Good catch! ########## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java ########## @@ -134,16 +134,28 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext * @return {@link RDDCustomColumnsSortPartitioner} if sort columns are provided, otherwise empty. */ protected Option<BulkInsertPartitioner<T>> getPartitioner(Map<String, String> strategyParams, Schema schema) { - if (getWriteConfig().isLayoutOptimizationEnabled()) { - // sort input records by z-order/hilbert - return Option.of(new RDDSpatialCurveOptimizationSortPartitioner((HoodieSparkEngineContext) getEngineContext(), - getWriteConfig(), HoodieAvroUtils.addMetadataFields(schema))); - } else if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) { - return Option.of(new RDDCustomColumnsSortPartitioner(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(","), - HoodieAvroUtils.addMetadataFields(schema), getWriteConfig().isConsistentLogicalTimestampEnabled())); - } else { - return Option.empty(); - } + Option<String[]> orderByColumnsOpt = + Option.ofNullable(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key())) + .map(listStr -> listStr.split(",")); + + return orderByColumnsOpt.map(orderByColumns -> { Review comment: It will fallback to no-op in that case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org