[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r386218656 ## File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java ## @@ -381,20 +384,30 @@ public static SparkConf registerClasses(SparkConf conf) { } } + private BulkInsertMapFunction getBulkInsertMapFunction( Review comment: Lets do whats easier and gets us moving faster .. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r376236746 ## File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java ## @@ -381,20 +384,30 @@ public static SparkConf registerClasses(SparkConf conf) { } } + private BulkInsertMapFunction getBulkInsertMapFunction( Review comment: I was looking at this for a different JIRA. but having a class for BulkInsertMapfunction seems like an overkill? Can we just replace with lambdas and kill the class ? eg, in the code on master, just change to below ``` JavaRDD writeStatusRDD = repartitionedRecords .mapPartitionsWithIndex((partition, recordItr) -> new CopyOnWriteLazyInsertIterable<>(recordItr, config, commitTime, table, fileIDPrefixes.get(partition)), true) .flatMap(List::iterator); ``` if you agree, just do it in the PR.. that way I wont introduce an unnecessary rebase for you :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r376236783 ## File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java ## @@ -381,20 +384,30 @@ public static SparkConf registerClasses(SparkConf conf) { } } + private BulkInsertMapFunction getBulkInsertMapFunction( Review comment: cc @yihua This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r374512864 ## File path: hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/BulkInsertInternalPartitioner.java ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.func.bulkinsert; + +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.table.UserDefinedBulkInsertPartitioner; + +public abstract class BulkInsertInternalPartitioner implements +UserDefinedBulkInsertPartitioner { Review comment: I was trying to generalize the existing interface, since what we are doing here is not userdefined anyway.. >but it might be best to avoid a breaking change since it's simple in this case ? this is also fair.. so we can leave the current interface as is and you can just extend/implement it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362366288 ## File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java ## @@ -367,20 +370,30 @@ public static SparkConf registerClasses(SparkConf conf) { } } + private BulkInsertMapFunction getBulkInsertMapFunction( + boolean isSorted, String commitTime, HoodieWriteConfig config, HoodieTable hoodieTable, + List fileIDPrefixes) { +if (isSorted) { + return new BulkInsertMapFunctionForSortedRecords( + commitTime, config, hoodieTable, fileIDPrefixes); +} +return new BulkInsertMapFunctionForNonSortedRecords( +commitTime, config, hoodieTable, fileIDPrefixes); + } + private JavaRDD bulkInsertInternal(JavaRDD> dedupedRecords, String commitTime, HoodieTable table, Option bulkInsertPartitioner) { Review comment: This was added by @ovj and uber/marmaray project passes in an implementation using the RDD api This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295734 ## File path: hudi-client/src/main/java/org/apache/hudi/func/CopyOnWriteInsertHandler.java ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.func; + +import java.util.ArrayList; +import java.util.List; +import org.apache.hudi.WriteStatus; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.func.CopyOnWriteLazyInsertIterable.HoodieInsertValueGenResult; +import org.apache.hudi.io.HoodieCreateHandle; +import org.apache.hudi.io.HoodieWriteHandle; +import org.apache.hudi.table.HoodieTable; + +/** + * Consumes stream of hoodie records from in-memory queue and writes to one or more create-handles. Review comment: Improve the javadocs? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295936 ## File path: hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/NonSortPartitioner.java ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.func.bulkinsert; + +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.spark.api.java.JavaRDD; + +public class NonSortPartitioner Review comment: More like `NonShufflingPartitioner`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295839 ## File path: hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/BulkInsertInternalPartitioner.java ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.func.bulkinsert; + +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.table.UserDefinedBulkInsertPartitioner; + +public abstract class BulkInsertInternalPartitioner implements +UserDefinedBulkInsertPartitioner { Review comment: RDD API clients i.e Uber/marmaray may need to change the name of the interface implemented. It should be fine IMO. cc @bvaradar @n3nash This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362277905 ## File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -77,6 +77,10 @@ private static final String DEFAULT_HOODIE_WRITE_STATUS_CLASS = WriteStatus.class.getName(); private static final String FINALIZE_WRITE_PARALLELISM = "hoodie.finalize.write.parallelism"; private static final String DEFAULT_FINALIZE_WRITE_PARALLELISM = DEFAULT_PARALLELISM; + private static final String BULKINSERT_SORT_ENABLED = "hoodie.bulkinsert.sort.enable"; Review comment: instead of sorting vs non-sorting.. and having two configs.. Can we just have `hoodie.bulkinsert.write.mode` as the single config, which supports following values - `NONE` (No sorting/shuffling) (default) - `GLOBALLY_SORTED` - `PARTITION_SORTED` might be much easier to reason about for end user This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295790 ## File path: hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/BulkInsertInternalPartitioner.java ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.func.bulkinsert; + +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.table.UserDefinedBulkInsertPartitioner; + +public abstract class BulkInsertInternalPartitioner implements +UserDefinedBulkInsertPartitioner { Review comment: We can rename UserDefinedBulkInsertPartitioner to just `BulkInsertPartitioner` and reuse it I think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362278533 ## File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java ## @@ -367,20 +370,30 @@ public static SparkConf registerClasses(SparkConf conf) { } } + private BulkInsertMapFunction getBulkInsertMapFunction( + boolean isSorted, String commitTime, HoodieWriteConfig config, HoodieTable hoodieTable, + List fileIDPrefixes) { +if (isSorted) { + return new BulkInsertMapFunctionForSortedRecords( + commitTime, config, hoodieTable, fileIDPrefixes); +} +return new BulkInsertMapFunctionForNonSortedRecords( +commitTime, config, hoodieTable, fileIDPrefixes); + } + private JavaRDD bulkInsertInternal(JavaRDD> dedupedRecords, String commitTime, HoodieTable table, Option bulkInsertPartitioner) { Review comment: does it make sense to still have a separate `UserDefinedBulkInsertPartitioner` ? could we just combine this into `BulkInsertInternalPartitioner` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295974 ## File path: hudi-client/src/main/java/org/apache/hudi/table/UserDefinedBulkInsertPartitioner.java ## @@ -31,4 +31,6 @@ public interface UserDefinedBulkInsertPartitioner { JavaRDD> repartitionRecords(JavaRDD> records, int outputSparkPartitions); + + boolean arePartitionRecordsSorted(); Review comment: please improve javadocs of this interface while you are here :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services