[GitHub] [hudi] nsivabalan commented on a diff in pull request #5462: [HUDI-3995] Making pref optimizations for bulk insert row writer path

2022-05-06 Thread GitBox


nsivabalan commented on code in PR #5462:
URL: https://github.com/apache/hudi/pull/5462#discussion_r866736485


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java:
##
@@ -42,7 +42,9 @@
   public static final String OPERATION_METADATA_FIELD = "_hoodie_operation";
   public static final String HOODIE_IS_DELETED = "_hoodie_is_deleted";
 
-  public static int FILENAME_METADATA_FIELD_POS = 4;
+  public static int RECORD_KEY_METAD_FIELD_POS = 2;

Review Comment:
   ```
 public static int RECORD_KEY_META_FIELD_POS = 
HOODIE_META_COLUMNS_NAME_TO_POS.get(RECORD_KEY_METADATA_FIELD);
 public static int PARTITION_PATH_META_FIELD_POS = 
HOODIE_META_COLUMNS_NAME_TO_POS.get(PARTITION_PATH_METADATA_FIELD);
 public static int FILENAME_META_FIELD_POS = 
HOODIE_META_COLUMNS_NAME_TO_POS.get(FILENAME_METADATA_FIELD);
   ```
   was this your proposal ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #5462: [HUDI-3995] Making pref optimizations for bulk insert row writer path

2022-04-28 Thread GitBox


nsivabalan commented on code in PR #5462:
URL: https://github.com/apache/hudi/pull/5462#discussion_r861390575


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/RowKeyGeneratorHelper.java:
##
@@ -234,13 +237,14 @@ public static Object getNestedFieldVal(Row row, 
List positions) {
* @param structType  schema of interest
* @param field   field of interest for which the positions are 
requested for
* @param isRecordKey {@code true} if the field requested for is a record 
key. {@code false} in case of a partition path.
-   * @return the positions of the field as per the struct type.
+   * @return the positions of the field as per the struct type and the root 
fields datatype.
*/
-  public static List getNestedFieldIndices(StructType structType, 
String field, boolean isRecordKey) {
+  public static Pair, DataType> 
getNestedFieldSchemaInfo(StructType structType, String field, boolean 
isRecordKey) {

Review Comment:
   we don't allow nested fields with row writer code path and with meta fields 
disabled. This is the only code path, where we fetch data directly from 
InternalRow. If not, we use spark sql Row to fetch data. So, due to these 
reasons, in this method, we have certain additional constraints for record keys 
which is not applicable to partition path fields. 
   
   I am open to discuss if we can relax the constraint for meta field disabling 
for nested fields too. but we can take it up in a follow up PR. 
   
   



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##
@@ -97,88 +98,69 @@ public String getPartitionPath(Row row) {
   @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
   public String getPartitionPath(InternalRow internalRow, StructType 
structType) {
 try {
-  initDeserializer(structType);
-  Row row = sparkRowSerDe.deserializeRow(internalRow);
-  return getPartitionPath(row);
+  buildFieldSchemaInfoIfNeeded(structType);
+  return 
RowKeyGeneratorHelper.getPartitionPathFromInternalRow(internalRow, 
getPartitionPathFields(),
+  hiveStylePartitioning, partitionPathSchemaInfo);
 } catch (Exception e) {
   throw new HoodieIOException("Conversion of InternalRow to Row failed 
with exception " + e);
 }
   }
 
-  private void initDeserializer(StructType structType) {
-if (sparkRowSerDe == null) {
-  sparkRowSerDe = HoodieSparkUtils.getDeserializer(structType);
-}
-  }
-
-  void buildFieldPositionMapIfNeeded(StructType structType) {
+  void buildFieldSchemaInfoIfNeeded(StructType structType) {
 if (this.structType == null) {
   // parse simple fields
   getRecordKeyFields().stream()
   .filter(f -> !(f.contains(".")))
   .forEach(f -> {
 if (structType.getFieldIndex(f).isDefined()) {
-  recordKeyPositions.put(f, Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
+  int fieldIndex = (int) structType.getFieldIndex(f).get();

Review Comment:
   due to how code evolved. we can fix it now. 



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##
@@ -107,16 +107,15 @@ public HoodieRowCreateHandle(HoodieTable table, 
HoodieWriteConfig writeConfig, S
   /**
* Writes an {@link InternalRow} to the underlying 
HoodieInternalRowFileWriter. Before writing, value for meta columns are 
computed as required
* and wrapped in {@link HoodieInternalRow}. {@link HoodieInternalRow} is 
what gets written to HoodieInternalRowFileWriter.
+   *
* @param record instance of {@link InternalRow} that needs to be written to 
the fileWriter.
* @throws IOException
*/
   public void write(InternalRow record) throws IOException {
 try {
-  String partitionPath = 
record.getUTF8String(HoodieRecord.HOODIE_META_COLUMNS_NAME_TO_POS.get(
-  HoodieRecord.PARTITION_PATH_METADATA_FIELD)).toString();
+  String partitionPath = record.getUTF8String(3).toString();

Review Comment:
   sorry, do you mean final ? these are local variables right.



##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -386,12 +386,14 @@ public void validateTableProperties(Properties 
properties, WriteOperationType op
   throw new HoodieException(HoodieTableConfig.POPULATE_META_FIELDS.key() + 
" already disabled for the table. Can't be re-enabled back");
 }
 
-// meta fields can be disabled only with SimpleKeyGenerator
-if (!getTableConfig().populateMetaFields()
-&& 
!properties.getProperty(HoodieTableConfig.KEY_GENERATOR_CLASS_NAME.key(), 
"org.apache.hudi.keygen.SimpleKeyGenerator")
-.equals("org.apache.hudi.keygen.SimpleKeyGenerator")) {
-  throw new HoodieException("Only simple key generator is supported when 
meta fields are disabled. KeyGenerator used : "
-  + 
properties.getPropert

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5462: [HUDI-3995] Making pref optimizations for bulk insert row writer path

2022-04-28 Thread GitBox


nsivabalan commented on code in PR #5462:
URL: https://github.com/apache/hudi/pull/5462#discussion_r861359728


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java:
##
@@ -57,18 +61,18 @@ public class HoodieDatasetBulkInsertHelper {
 
   /**
* Prepares input hoodie spark dataset for bulk insert. It does the 
following steps.
-   *  1. Uses KeyGenerator to generate hoodie record keys and partition path.
-   *  2. Add hoodie columns to input spark dataset.
-   *  3. Reorders input dataset columns so that hoodie columns appear in the 
beginning.
-   *  4. Sorts input dataset by hoodie partition path and record key
+   * 1. Uses KeyGenerator to generate hoodie record keys and partition path.
+   * 2. Add hoodie columns to input spark dataset.
+   * 3. Reorders input dataset columns so that hoodie columns appear in the 
beginning.
+   * 4. Sorts input dataset by hoodie partition path and record key
*
* @param sqlContext SQL Context
-   * @param config Hoodie Write Config
-   * @param rows Spark Input dataset
+   * @param config Hoodie Write Config
+   * @param rows   Spark Input dataset
* @return hoodie dataset which is ready for bulk insert.
*/
   public static Dataset prepareHoodieDatasetForBulkInsert(SQLContext 
sqlContext,
-  HoodieWriteConfig config, Dataset rows, String structName, String 
recordNamespace,
+   
HoodieWriteConfig config, Dataset rows, String structName, String 
recordNamespace,

Review Comment:
   yeah, I get it. I wanna share my patch w/ a user who was having perf issue 
and hence have added everything I had. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #5462: [HUDI-3995] Making pref optimizations for bulk insert row writer path

2022-04-28 Thread GitBox


nsivabalan commented on code in PR #5462:
URL: https://github.com/apache/hudi/pull/5462#discussion_r861359423


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -386,12 +386,14 @@ public void validateTableProperties(Properties 
properties, WriteOperationType op
   throw new HoodieException(HoodieTableConfig.POPULATE_META_FIELDS.key() + 
" already disabled for the table. Can't be re-enabled back");
 }
 
-// meta fields can be disabled only with SimpleKeyGenerator
-if (!getTableConfig().populateMetaFields()
-&& 
!properties.getProperty(HoodieTableConfig.KEY_GENERATOR_CLASS_NAME.key(), 
"org.apache.hudi.keygen.SimpleKeyGenerator")
-.equals("org.apache.hudi.keygen.SimpleKeyGenerator")) {
-  throw new HoodieException("Only simple key generator is supported when 
meta fields are disabled. KeyGenerator used : "
-  + 
properties.getProperty(HoodieTableConfig.KEY_GENERATOR_CLASS_NAME.key()));
+// meta fields can be disabled only with SimpleKeyGenerator, 
NonPartitioned and ComplexKeyGen.
+if (!getTableConfig().populateMetaFields()) {

Review Comment:
   in spark-sql default key gen is complexKeyGen. thats why I had to make this 
fix. we need to revisit that and make simple key gen as default (when user does 
not explicitly set one). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org