[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-03-01 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r386218656
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -381,20 +384,30 @@ public static SparkConf registerClasses(SparkConf conf) {
 }
   }
 
+  private BulkInsertMapFunction getBulkInsertMapFunction(
 
 Review comment:
   Lets do whats easier and gets us moving faster .. :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-02-06 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r376236746
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -381,20 +384,30 @@ public static SparkConf registerClasses(SparkConf conf) {
 }
   }
 
+  private BulkInsertMapFunction getBulkInsertMapFunction(
 
 Review comment:
   I was looking at this for a different JIRA. but having a class for 
BulkInsertMapfunction seems like an overkill? Can we just replace with lambdas 
and kill the class ?
   
   eg, in the code on master, just change to below
   
   ```
JavaRDD writeStatusRDD = repartitionedRecords
   .mapPartitionsWithIndex((partition, recordItr) ->
   new CopyOnWriteLazyInsertIterable<>(recordItr, config, 
commitTime, table, fileIDPrefixes.get(partition)), true)
   .flatMap(List::iterator);
   
   ```
   
   if you agree, just do it in the PR.. that way I wont introduce an 
unnecessary rebase for you :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-02-06 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r376236783
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -381,20 +384,30 @@ public static SparkConf registerClasses(SparkConf conf) {
 }
   }
 
+  private BulkInsertMapFunction getBulkInsertMapFunction(
 
 Review comment:
   cc @yihua  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-02-03 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r374512864
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/BulkInsertInternalPartitioner.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.table.UserDefinedBulkInsertPartitioner;
+
+public abstract class BulkInsertInternalPartitioner implements
+UserDefinedBulkInsertPartitioner {
 
 Review comment:
   I was trying to generalize the existing interface, since what we are doing 
here is not userdefined anyway.. 
   
   >but it might be best to avoid a breaking change since it's simple in this 
case ?
   
   this is also fair.. so we can leave the current interface as is and you can 
just extend/implement it.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-01-01 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362366288
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -367,20 +370,30 @@ public static SparkConf registerClasses(SparkConf conf) {
 }
   }
 
+  private BulkInsertMapFunction getBulkInsertMapFunction(
+  boolean isSorted, String commitTime, HoodieWriteConfig config, 
HoodieTable hoodieTable,
+  List fileIDPrefixes) {
+if (isSorted) {
+  return new BulkInsertMapFunctionForSortedRecords(
+  commitTime, config, hoodieTable, fileIDPrefixes);
+}
+return new BulkInsertMapFunctionForNonSortedRecords(
+commitTime, config, hoodieTable, fileIDPrefixes);
+  }
+
   private JavaRDD bulkInsertInternal(JavaRDD> 
dedupedRecords, String commitTime,
   HoodieTable table, Option 
bulkInsertPartitioner) {
 
 Review comment:
   This was added by @ovj and uber/marmaray project passes in an implementation 
using the RDD api 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2019-12-31 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295734
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/CopyOnWriteInsertHandler.java
 ##
 @@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func;
+
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer;
+import org.apache.hudi.config.HoodieWriteConfig;
+import 
org.apache.hudi.func.CopyOnWriteLazyInsertIterable.HoodieInsertValueGenResult;
+import org.apache.hudi.io.HoodieCreateHandle;
+import org.apache.hudi.io.HoodieWriteHandle;
+import org.apache.hudi.table.HoodieTable;
+
+/**
+ * Consumes stream of hoodie records from in-memory queue and writes to one or 
more create-handles.
 
 Review comment:
   Improve the javadocs? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2019-12-31 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295936
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/NonSortPartitioner.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.spark.api.java.JavaRDD;
+
+public class NonSortPartitioner
 
 Review comment:
   More like `NonShufflingPartitioner`? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2019-12-31 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295839
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/BulkInsertInternalPartitioner.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.table.UserDefinedBulkInsertPartitioner;
+
+public abstract class BulkInsertInternalPartitioner implements
+UserDefinedBulkInsertPartitioner {
 
 Review comment:
   RDD API clients i.e Uber/marmaray  may need to change the name of the 
interface implemented. It should be fine IMO. cc @bvaradar @n3nash 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2019-12-31 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362277905
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -77,6 +77,10 @@
   private static final String DEFAULT_HOODIE_WRITE_STATUS_CLASS = 
WriteStatus.class.getName();
   private static final String FINALIZE_WRITE_PARALLELISM = 
"hoodie.finalize.write.parallelism";
   private static final String DEFAULT_FINALIZE_WRITE_PARALLELISM = 
DEFAULT_PARALLELISM;
+  private static final String BULKINSERT_SORT_ENABLED = 
"hoodie.bulkinsert.sort.enable";
 
 Review comment:
   instead of sorting vs non-sorting.. and having two configs.. Can we just 
have `hoodie.bulkinsert.write.mode` as the single config, which supports 
following values 
   
   - `NONE` (No sorting/shuffling) (default)
   - `GLOBALLY_SORTED` 
   - `PARTITION_SORTED`
   
   might be much easier to reason about for end user


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2019-12-31 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295790
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/func/bulkinsert/BulkInsertInternalPartitioner.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.func.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.table.UserDefinedBulkInsertPartitioner;
+
+public abstract class BulkInsertInternalPartitioner implements
+UserDefinedBulkInsertPartitioner {
 
 Review comment:
   We can rename UserDefinedBulkInsertPartitioner to just 
`BulkInsertPartitioner` and reuse it I think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2019-12-31 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362278533
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -367,20 +370,30 @@ public static SparkConf registerClasses(SparkConf conf) {
 }
   }
 
+  private BulkInsertMapFunction getBulkInsertMapFunction(
+  boolean isSorted, String commitTime, HoodieWriteConfig config, 
HoodieTable hoodieTable,
+  List fileIDPrefixes) {
+if (isSorted) {
+  return new BulkInsertMapFunctionForSortedRecords(
+  commitTime, config, hoodieTable, fileIDPrefixes);
+}
+return new BulkInsertMapFunctionForNonSortedRecords(
+commitTime, config, hoodieTable, fileIDPrefixes);
+  }
+
   private JavaRDD bulkInsertInternal(JavaRDD> 
dedupedRecords, String commitTime,
   HoodieTable table, Option 
bulkInsertPartitioner) {
 
 Review comment:
   does it make sense to still have a separate 
`UserDefinedBulkInsertPartitioner` ? could we just combine this into 
`BulkInsertInternalPartitioner` ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2019-12-31 Thread GitBox
vinothchandar commented on a change in pull request #1149: [WIP] [HUDI-472] 
Introduce configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#discussion_r362295974
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/UserDefinedBulkInsertPartitioner.java
 ##
 @@ -31,4 +31,6 @@
 public interface UserDefinedBulkInsertPartitioner {
 
   JavaRDD> repartitionRecords(JavaRDD> 
records, int outputSparkPartitions);
+
+  boolean arePartitionRecordsSorted();
 
 Review comment:
   please improve javadocs of this interface while you are here :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services