date:20210322

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-22 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r599288753



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter {
 val archiveLogFolder = parameters.getOrElse(
   HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived")
 
+val partitionColumns = 
parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null)
+
 val tableMetaClient = HoodieTableMetaClient.withPropertyBuilder()
   .setTableType(tableType)
   .setTableName(tblName)
   .setArchiveLogFolder(archiveLogFolder)
   .setPayloadClassName(parameters(PAYLOAD_CLASS_OPT_KEY))
   
.setPreCombineField(parameters.getOrDefault(PRECOMBINE_FIELD_OPT_KEY, null))
+  .setPartitionColumns(partitionColumns)

Review comment:
   Thank you for reminding me about this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table

2021-03-22 Thread GitBox



Sugamber commented on issue #2637:
URL: https://github.com/apache/hudi/issues/2637#issuecomment-804639531


   public class PartialColumnUpdate implements 
HoodieRecordPayload {
   private static final Logger logger = 
Logger.getLogger(PartialColumnUpdate.class);
   private byte[] recordBytes;
   private Schema schema;
   private Comparable orderingVal;
   
   
   public PartialColumnUpdate(GenericRecord genericRecord, Comparable 
orderingVal) {
   logger.info("Inside two parameter cons");
   try {
   if (genericRecord != null) {
   this.recordBytes = 
HoodieAvroUtils.avroToBytes(genericRecord);
   this.schema = genericRecord.getSchema();
   this.orderingVal = orderingVal;
   } else {
   this.recordBytes = new byte[0];
   }
   } catch (Exception io) {
   throw new RuntimeException("Cannot convert record to bytes ", 
io);
   }
   }
   
   public PartialColumnUpdate(Option record) {
   this(record.isPresent() ? record.get() : null, 0);
   }
   
   @Override
   public PartialColumnUpdate preCombine(PartialColumnUpdate anotherRecord) 
{
   logger.info("Inside PreCombine");
   logger.info("preCombine => " + anotherRecord);
   logger.info("another_ordering value" + anotherRecord.orderingVal);
   logger.info("another_ schema value" + anotherRecord.schema);
   logger.info("another_ record bytes value" + 
anotherRecord.recordBytes);
   if (anotherRecord.orderingVal.compareTo(orderingVal) > 0) {
   return anotherRecord;
   } else {
   return this;
   }
   }
   
   
   @Override
   public Option combineAndGetUpdateValue(IndexedRecord 
indexedRecord, Schema currentSchema) throws IOException {
   logger.info("Inside combineAndGetUpdateValue");
   logger.info("current schema" + currentSchema);
   logger.info("combineUpdate - >" + Option.of(indexedRecord));
   getInsertValue(currentSchema);
   return Option.empty();
   }
   
   @Override
   public Option getInsertValue(Schema schema) throws 
IOException {
   logger.info("Inside getInsertValue");
   if (recordBytes.length == 0) {
   return Option.empty();
   }
   IndexedRecord indexedRecord = 
HoodieAvroUtils.bytesToAvro(recordBytes, schema);
   if (isDeleteRecord((GenericRecord) indexedRecord)) {
   return Option.empty();
   } else {
   return Option.of(indexedRecord);
   }
   }
   
   protected boolean isDeleteRecord(GenericRecord genericRecord) {
   final String isDeleteKey = "_hoodie_is_deleted";
   if (genericRecord.getSchema().getField(isDeleteKey) == null) {
   return false;
   }
   Object deleteMarker = genericRecord.get(isDeleteKey);
   return (deleteMarker instanceof Boolean && (boolean) deleteMarker);
   }
   }


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Sugamber edited a comment on issue #2637: [SUPPORT] - Partial Update : update few columns of a table

2021-03-22 Thread GitBox



Sugamber edited a comment on issue #2637:
URL: https://github.com/apache/hudi/issues/2637#issuecomment-804636424


   I have created one class after implementing HoodieRecordPayload. We have 
three methods for which we have to write our logic.
   1. preCombine
   2. combineAndGetUpdateValue
   3. getInsertValue
   @n3nash  As per your above explanation,  preCombine would provide the 
current record which is coming in incremental load and combineAndGetUpdateValue 
will provide the latest records from hoodie table.
   Please correct me if my understanding is incorrect.
   
   In my use case , I'm only getting few columns out of 20 in incremental data. 
preCombine method does not have any schema details.
   For Example - Hudi table  built with 20 columns. Now, requirement is to 
update only 3 columns and only these columns data is coming from incremental 
data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and 
PRECOMBINE_FIELD_OPT_KEY column.
   I have implemented the class as below. Please let me know in which method, 
I'll be getting  full schema of the table.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-22 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r599287887



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -179,6 +179,9 @@
   public static final String EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION = 
AVRO_SCHEMA + ".externalTransformation";
   public static final String DEFAULT_EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION 
= "false";
 
+  public static final String  MAX_LISTING_PARALLELISM = 
"hoodie.max.list.file.parallelism";
+  public static final Integer DEFAULT_MAX_LISTING_PARALLELISM = 200;

Review comment:
   Good suggestions!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Sugamber edited a comment on issue #2637: [SUPPORT] - Partial Update : update few columns of a table

2021-03-22 Thread GitBox



Sugamber edited a comment on issue #2637:
URL: https://github.com/apache/hudi/issues/2637#issuecomment-804636424


   I have created one class after implementing HoodieRecordPayload. We have 
three methods for which we have to write our logic.
   1. preCombine
   2. combineAndGetUpdateValue
   3. getInsertValue
   @n3nash  As per your above explanation,  preCombine would provide the 
current record which is coming in incremental load and combineAndGetUpdateValue 
will provide the latest records from hoodie table.
   Please correct me if my understanding is incorrect.
   
   In my use case , I'm only getting few columns out of 20 in incremental data. 
preCombine method does not have any schema details.
   For Example - Hudi table  built with 20 columns. Now, requirement is to 
update only 3 columns and only these columns data is coming from incremental 
data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and 
PRECOMBINE_FIELD_OPT_KEY column.
   I have implemented the class as below. Please let me know in which method, 
I'll be getting  full schema of the table.
   
   
   `public class PartialColumnUpdate implements 
HoodieRecordPayload {
   private static final Logger logger = 
Logger.getLogger(PartialColumnUpdate.class);
   private byte[] recordBytes;
   private Schema schema;
   private Comparable orderingVal;
   
   public PartialColumnUpdate(GenericRecord genericRecord, Comparable 
orderingVal) {
   logger.info("Inside two parameter cons");
   try {
   if (genericRecord != null) {
   this.recordBytes = 
HoodieAvroUtils.avroToBytes(genericRecord);
   this.schema = genericRecord.getSchema();
   this.orderingVal = orderingVal;
   } else {
   this.recordBytes = new byte[0];
   }
   } catch (Exception io) {
   throw new RuntimeException("Cannot convert record to bytes ", 
io);
   }
   }
   public PartialColumnUpdate(Option record) {
   this(record.isPresent() ? record.get() : null, 0);
   }
   @Override
   public PartialColumnUpdate preCombine(PartialColumnUpdate anotherRecord) 
{
   logger.info("Inside PreCombine");
   logger.info("preCombine => " + anotherRecord);
   logger.info("another_ordering value" + anotherRecord.orderingVal);
   logger.info("another_ schema value" + anotherRecord.schema);
   logger.info("another_ record bytes value" + 
anotherRecord.recordBytes);
   if (anotherRecord.orderingVal.compareTo(orderingVal) > 0) {
   return anotherRecord;
   } else {
   return this;
   }
   }
   @Override
   public Option combineAndGetUpdateValue(IndexedRecord 
indexedRecord, Schema currentSchema) throws IOException {
   logger.info("Inside combineAndGetUpdateValue");
   logger.info("current schema" + currentSchema);
   logger.info("combineUpdate - >" + Option.of(indexedRecord));
   getInsertValue(currentSchema);
   return Option.empty();
   }
   @Override
   public Option getInsertValue(Schema schema) throws 
IOException {
   logger.info("Inside getInsertValue");
   if (recordBytes.length == 0) {
   return Option.empty();
   }
   IndexedRecord indexedRecord = 
HoodieAvroUtils.bytesToAvro(recordBytes, schema);
   if (isDeleteRecord((GenericRecord) indexedRecord)) {
   return Option.empty();
   } else {
   return Option.of(indexedRecord);
   }
   }
   protected boolean isDeleteRecord(GenericRecord genericRecord) {
   final String isDeleteKey = "_hoodie_is_deleted";
   if (genericRecord.getSchema().getField(isDeleteKey) == null) {
   return false;
   }
   Object deleteMarker = genericRecord.get(isDeleteKey);
   return (deleteMarker instanceof Boolean && (boolean) deleteMarker);
   }
   }`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Sugamber edited a comment on issue #2637: [SUPPORT] - Partial Update : update few columns of a table

2021-03-22 Thread GitBox



Sugamber edited a comment on issue #2637:
URL: https://github.com/apache/hudi/issues/2637#issuecomment-804636424


   I have created one class after implementing HoodieRecordPayload. We have 
three methods for which we have to write our logic.
   1. preCombine
   2. combineAndGetUpdateValue
   3. getInsertValue
   @n3nash  As per your above explanation,  preCombine would provide the 
current record which is coming in incremental load and combineAndGetUpdateValue 
will provide the latest records from hoodie table.
   Please correct me if my understanding is incorrect.
   
   In my use case , I'm only getting few columns out of 20 in incremental data. 
preCombine method does not have any schema details.
   For Example - Hudi table  built with 20 columns. Now, requirement is to 
update only 3 columns and only these columns data is coming from incremental 
data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and 
PRECOMBINE_FIELD_OPT_KEY column.
   I have implemented the class as below. Please let me know in which method, 
I'll be getting  full schema of the table.
   
   `
   public class PartialColumnUpdate implements 
HoodieRecordPayload {
   private static final Logger logger = 
Logger.getLogger(PartialColumnUpdate.class);
   private byte[] recordBytes;
   private Schema schema;
   private Comparable orderingVal;
   
   
   public PartialColumnUpdate(GenericRecord genericRecord, Comparable 
orderingVal) {
   logger.info("Inside two parameter cons");
   try {
   if (genericRecord != null) {
   this.recordBytes = 
HoodieAvroUtils.avroToBytes(genericRecord);
   this.schema = genericRecord.getSchema();
   this.orderingVal = orderingVal;
   } else {
   this.recordBytes = new byte[0];
   }
   } catch (Exception io) {
   throw new RuntimeException("Cannot convert record to bytes ", 
io);
   }
   }
   
   public PartialColumnUpdate(Option record) {
   this(record.isPresent() ? record.get() : null, 0);
   }
   
   @Override
   public PartialColumnUpdate preCombine(PartialColumnUpdate anotherRecord) 
{
   logger.info("Inside PreCombine");
   logger.info("preCombine => " + anotherRecord);
   logger.info("another_ordering value" + anotherRecord.orderingVal);
   logger.info("another_ schema value" + anotherRecord.schema);
   logger.info("another_ record bytes value" + 
anotherRecord.recordBytes);
   if (anotherRecord.orderingVal.compareTo(orderingVal) > 0) {
   return anotherRecord;
   } else {
   return this;
   }
   }
   
   
   @Override
   public Option combineAndGetUpdateValue(IndexedRecord 
indexedRecord, Schema currentSchema) throws IOException {
   logger.info("Inside combineAndGetUpdateValue");
   logger.info("current schema" + currentSchema);
   logger.info("combineUpdate - >" + Option.of(indexedRecord));
   getInsertValue(currentSchema);
   return Option.empty();
   }
   
   @Override
   public Option getInsertValue(Schema schema) throws 
IOException {
   logger.info("Inside getInsertValue");
   if (recordBytes.length == 0) {
   return Option.empty();
   }
   IndexedRecord indexedRecord = 
HoodieAvroUtils.bytesToAvro(recordBytes, schema);
   if (isDeleteRecord((GenericRecord) indexedRecord)) {
   return Option.empty();
   } else {
   return Option.of(indexedRecord);
   }
   }
   
   protected boolean isDeleteRecord(GenericRecord genericRecord) {
   final String isDeleteKey = "_hoodie_is_deleted";
   if (genericRecord.getSchema().getField(isDeleteKey) == null) {
   return false;
   }
   Object deleteMarker = genericRecord.get(isDeleteKey);
   return (deleteMarker instanceof Boolean && (boolean) deleteMarker);
   }
   }
   `
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] cdmikechen commented on issue #2705: [SUPPORT] Can not read data schema using Spark3.0.2 on k8s with hudi-utilities (build in 2.12 and spark3)

2021-03-22 Thread GitBox



cdmikechen commented on issue #2705:
URL: https://github.com/apache/hudi/issues/2705#issuecomment-804636641


   I've found the problem: 
   There is a new configuration named 
`hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable` and it 
is `true` by default. If I use my custom transformer and set `target scheme` 
null, hudi will not work by a null schema.
   I set `target scheme` to the same as `source schema` for testing, so that 
spark will not work and report above errors. If I set 
`hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable` to 
false, hudi will successfully deal with the Kafka message and write it to hdfs.
   
   However, when synchronizing hive, I encountered the same problem as this  
https://github.com/apache/hudi/issues/1751#issuecomment-648460431. I think hudi 
still lost related packets by hive3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table

2021-03-22 Thread GitBox



Sugamber commented on issue #2637:
URL: https://github.com/apache/hudi/issues/2637#issuecomment-804636424


   I have created one class after implementing HoodieRecordPayload. We have 
three methods for which we have to write our logic.
   1. preCombine
   2. combineAndGetUpdateValue
   3. getInsertValue
   @n3nash  As per your above explanation,  preCombine would provide the 
current record which is coming in incremental load and combineAndGetUpdateValue 
will provide the latest records from hoodie table.
   Please correct me if my understanding is incorrect.
   
   In my use case , I'm only getting few columns out of 20 in incremental data. 
preCombine method does not have any schema details.
   For Example - Hudi table  built with 20 columns. Now, requirement is to 
update only 3 columns and only these columns data is coming from incremental 
data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and 
PRECOMBINE_FIELD_OPT_KEY column.
   I have implemented the class as below. Please let me know in which method, 
I'll be getting  full schema of the table.
   
   `public class PartialColumnUpdate implements 
HoodieRecordPayload {
   private static final Logger logger = 
Logger.getLogger(PartialColumnUpdate.class);
   private byte[] recordBytes;
   private Schema schema;
   private Comparable orderingVal;
   
   
   public PartialColumnUpdate(GenericRecord genericRecord, Comparable 
orderingVal) {
   logger.info("Inside two parameter cons");
   try {
   if (genericRecord != null) {
   this.recordBytes = 
HoodieAvroUtils.avroToBytes(genericRecord);
   this.schema = genericRecord.getSchema();
   this.orderingVal = orderingVal;
   } else {
   this.recordBytes = new byte[0];
   }
   } catch (Exception io) {
   throw new RuntimeException("Cannot convert record to bytes ", 
io);
   }
   }
   
   public PartialColumnUpdate(Option record) {
   this(record.isPresent() ? record.get() : null, 0);
   }
   
   @Override
   public PartialColumnUpdate preCombine(PartialColumnUpdate anotherRecord) 
{
   logger.info("Inside PreCombine");
   logger.info("preCombine => " + anotherRecord);
   logger.info("another_ordering value" + anotherRecord.orderingVal);
   logger.info("another_ schema value" + anotherRecord.schema);
   logger.info("another_ record bytes value" + 
anotherRecord.recordBytes);
   if (anotherRecord.orderingVal.compareTo(orderingVal) > 0) {
   return anotherRecord;
   } else {
   return this;
   }
   }
   
   
   @Override
   public Option combineAndGetUpdateValue(IndexedRecord 
indexedRecord, Schema currentSchema) throws IOException {
   logger.info("Inside combineAndGetUpdateValue");
   logger.info("current schema" + currentSchema);
   logger.info("combineUpdate - >" + Option.of(indexedRecord));
   getInsertValue(currentSchema);
   return Option.empty();
   }
   
   @Override
   public Option getInsertValue(Schema schema) throws 
IOException {
   logger.info("Inside getInsertValue");
   if (recordBytes.length == 0) {
   return Option.empty();
   }
   IndexedRecord indexedRecord = 
HoodieAvroUtils.bytesToAvro(recordBytes, schema);
   if (isDeleteRecord((GenericRecord) indexedRecord)) {
   return Option.empty();
   } else {
   return Option.of(indexedRecord);
   }
   }
   
   protected boolean isDeleteRecord(GenericRecord genericRecord) {
   final String isDeleteKey = "_hoodie_is_deleted";
   if (genericRecord.getSchema().getField(isDeleteKey) == null) {
   return false;
   }
   Object deleteMarker = genericRecord.get(isDeleteKey);
   return (deleteMarker instanceof Boolean && (boolean) deleteMarker);
   }
   }`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-57) [UMBRELLA] Support ORC Storage

2021-03-22 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306779#comment-17306779
 ] 

Vinoth Chandar commented on HUDI-57:


[~pwason] can you please update this JIRA? Also should we assign this to the 
intern? 

> [UMBRELLA] Support ORC Storage
> --
>
> Key: HUDI-57
> URL: https://issues.apache.org/jira/browse/HUDI-57
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Mani Jindal
>Priority: Major
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 0.8.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/uber/hudi/issues/68]
> https://github.com/uber/hudi/issues/155



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

svn commit: r46705 - /dev/hudi/KEYS

2021-03-22 Thread sivabalan

Author: sivabalan
Date: Tue Mar 23 04:50:02 2021
New Revision: 46705

Log:
Updating Gary's gpg key

Modified:
dev/hudi/KEYS

Modified: dev/hudi/KEYS
==
--- dev/hudi/KEYS (original)
+++ dev/hudi/KEYS Tue Mar 23 04:50:02 2021
@@ -601,3 +601,64 @@ ECppJfvmGRuNapsZ+KCXiY2wjnM9/EopD5Nsr3E7
 9ELkv7No+gWT7/64sox1Zo03duuWYR8bGpCJIcd6Qn99dPZSr59o8TGkrPU=
 =gJ2E
 -END PGP PUBLIC KEY BLOCK-
+
+pub   rsa4096 2021-03-22 [SC] [expires: 2025-03-22]
+  E2A9714E0FBA3A087BDEE655E72873D765D6C406
+uid   [ultimate] YanJia Li 
+sig 3E72873D765D6C406 2021-03-22  YanJia Li 
+sub   rsa4096 2021-03-22 [E] [expires: 2025-03-22]
+sig  E72873D765D6C406 2021-03-22  YanJia Li 
+
+-BEGIN PGP PUBLIC KEY BLOCK-
+
+mQINBGBYnFQBEADJfxkjdOufvAOu7yP1Q1wiM+FGQIcaFb7mydFc3/PpQwqxAPoS
+GlcorwkTMCdqKSxR4+p5B9xnfux9qXOydoKof0srhMLudD7lxZa6xAn0OeC2jeqk
+mFhXhw2/r+iuon9x7Rzts0HY7XvM3juQpTNa1cOi2jTsALpOyo2qDhPwNc7MNasC
+0OKuE0UwGfcDpd9TILIvOlssTNyHcYumavcDZBW9eZMpGF4jASPQzQ0iXnEAHyEr
+I55z9q760qNfAW72SO6vKBJZZVWUoCepGzOaB9VaX3fcYdfuOEm4bfKi4qEEEUaF
+aOeAo5jMbu+fhSDPBqfvthRyJitmit4rq49ijXJlwU8++mAEDUcLZ7SNMfnMht/N
+NazDmz5wXjFcbyKmaYAkQ/Q+7M161QsGLFq3WGmFej1Yv/nCo3tfM3j3aEc74jzR
+ylUQQQE+alJwVdN4CJ5SkyBtjBWMTbSHHagRlFoxnLUSktCOTM31vGVIoi/DrSdD
+Opxy6BatTIcUcrEW+XRkqeApmiBS6Oss6H0I0qBQJZL+o5F0wT8lrrwioy/qEzR1
+pgmtccHm4TBfa21CJDyNp8+VqM99fteM57dxBwHerR7vGlRfBjNY/s9SeUwKiNZw
+L7pmyQfhWXAN3m88xutpKoGpKwSL5S1rnvJl8N0dqeThSzZOB4i6zjUhQQARAQAB
+tB1ZYW5KaWEgTGkgPGdhcnlsaUBhcGFjaGUub3JnPokCVAQTAQgAPhYhBOKpcU4P
+ujoIe97mVecoc9dl1sQGBQJgWJxUAhsDBQkHhh8LBQsJCAcCBhUKCQgLAgQWAgMB
+Ah4BAheAAAoJEOcoc9dl1sQG9BIQAMJCp6lS5ycQXDE83XL/VaVO8iPIWiZySd7P
+Hf/XKab/kFIsXbAPrR5pPkcL8DzlarvklY7tTWfgzgY3yhh5L42eAdgH10Na1JWg
+x/JbBGea4I89v8lRMqAcslSmts9TyCZv4aRwwV9bwf9Y7b3WGXrd4gv8fd2XZtfH
+7pNNPg/B5XiWfTOQkV0S6I5lnpvgrNed3+BRJn+jYZrLLIlhPck4vShLtCnjm7TR
+XNrDilRxpSzs0d8Fzgp+paWuMX+W47CzKnRZGyISQ/KJfBlacEirNEyDy+j4P3er
+Pyn77QSFoBVM3SbM4wY40P+SW6bTblY+3ntO4Shb/2USb3J+w8jmwzkUXwmljgMD
+ojfvQa3rO5rPfPItdaRRtEH9YQvcYdZtnG7NwRRCc8SoqeJfsqYYEo6Iw8JVJqw9
++CIBQKie5z7/iS2/DEG4lQx57VzMURdZOoFUOvw6MEdqBqlMmwJyqG3caIXW9f4i
+T4TSQCr9M0ziJELCZSHBcJ5W6fB+bhYeRsZer32tcIQTnODuNrgZic7gngLsTC4E
+nl1Z6lVNczb0aQ0oGRBVb5dNROdKdSrk8hCyP4MQa2rJ4KenwrL0eyhUiIC8Bg5Q
+lLGErxnNP/cTPufTgzMcvVQ0PYliyOlGEUndOw0pFORg/xi1RzCNJQIi/k8bNoEj
+6u/gRkIRuQINBGBYnFQBEACzmYLb2UhAnG+Q059H1iTQbLSWektYx1WQY5Q9YjCG
+hwwimY1D2ePqVI2OfSwY/aAyM14t70LOeZFG3JtjE7wzuIltGTIPBiVIJRDdIeJv
+kWImZw4vbN+kMBhBnQwr7U3KNdwuD70MxHjCuQ2LFmP+Jb/Sv/6/kRr+s42PQAbL
+qH0FAA8Qcv99gg/dEC0uOOHB4UE7jEEDqIbedkk6GUy9YtgplbDlk+L1I0sqh7vu
+UI2bO42C1jDrNqgKJ4yQNNswriU3iri/i+0kwEFA8/oNIxdVpGiZrfBuwxmNTE4X
+A9dCrDpPGIs/gKS/vaykEqHdgi33D1DGtUHuCbUNalb7Er/22PhPbeZkK0sqpuKl
+u7AYdJpE1PNIiR6qMdBpQlY9F76BAwNq6gWx3eSrsgXWJ5ar5pLlhbF5j6sPjF/i
+DKwBKxY0fSLBoCv2abEgzvkjUwYxUsTXkKrEx0rYUbia2WuP95VjiTrrPs0LQrgp
+KG3Zn1FEOHqEN6kY3GeeBIMAnepYiCNS3WaD7RIlRKAkzVC82pYG71tP1/meDXTV
+LsGOjQrIDQpLDxtiG1rwRin6tzjKcUDBfIi4y1czAHjzEnx9uHCWvbEZ6KlTI35I
+SWFdLoFf2QDkmA6dgC4i0+emP4bZHRMLd7JRo+1ozTZ+hDq+z3QJJZacIH4a02vl
+eQARAQABiQI8BBgBCAAmFiEE4qlxTg+6Ogh73uZV5yhz12XWxAYFAmBYnFQCGwwF
+CQeGHwsACgkQ5yhz12XWxAbb3w//ai7iGR7WL2Wh6OvXICtS2WxAnXHu8XOsl91f
+tf0gx6oTWI0u2VbSqJDKJG5rbUPXyCmJbG32eq3PjTYWS0jT2kQFqkWQ5wX6AqZp
+lVNkT0GmmBuHRA71sp1PUHK2DaVDmHaTDncSvcdzDra8d0+/ANZ8licZlXF8D9rz
+9zGnxU/mbZ98xUJcVK3w8yea98bTV2cQLlTgYjLfmFoA/a8zyeuIotTUCELIA0Wq
+sAs8b0ORVm9Hk4G1q5eBem1FY8CzQQvVngrMUTOZdj1f0KXmM1Vii+T8eU8ukT82
+bScU7YRmO/XdMNaijmrsHmdP1ybW2KuP16m3ZIxXUu/mD6HIYCIFrIuin425E2kT
+hSh7xyZQMGRyJ+HlzUKm4d8Mg05SmErDaA+4APN5F6lP47ED0kT8RkRGmGBWWaZU
+sHbsjj6WYsEAVUcxErn+DelSS31j9P+8sCyI4Yi9/1IAr5VvYQXrrH3veCRHVjQZ
+KK/zA7lkrn26yldsuZXq4DArTmFUhCwRSNDEQgcfh/HOpmT8r7WZEGRBb99xXLyY
+HJzrVHbpIPxUpvBFld41Eyepuoij+pY7zyb/mCk5KMPEVK4XyYG9PpuPdER7EFzo
+K5pVT2a4wL0e0/ekCsGfbEn+2xubSrfWZ+M3YoIlX6uVykQrKH+NjoUlLuqv7PVF
+wV4zPZQ=
+=0iU3
+-END PGP PUBLIC KEY BLOCK-
+

[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table

2021-03-22 Thread GitBox



Sugamber commented on issue #2637:
URL: https://github.com/apache/hudi/issues/2637#issuecomment-804608999


   @nsivabalan , I had created shaded jar and it was causing the issues as few 
dependencies version were conflicting. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-22 Thread GitBox



umehrot2 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r599160889



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -179,6 +179,9 @@
   public static final String EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION = 
AVRO_SCHEMA + ".externalTransformation";
   public static final String DEFAULT_EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION 
= "false";
 
+  public static final String  MAX_LISTING_PARALLELISM = 
"hoodie.max.list.file.parallelism";
+  public static final Integer DEFAULT_MAX_LISTING_PARALLELISM = 200;

Review comment:
   - I think its fine to use the `DEFAULT_PARALLELISM` i.e `1500` as the 
default. It is what we use in `FileSystemBackedTableMetadata` as well.
   - We should add a method here to get this configuration just like all other 
configurations.

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##
@@ -0,0 +1,317 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.{HoodieMetadataConfig, 
SerializableConfiguration}
+import org.apache.hudi.common.engine.HoodieLocalEngineContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.{InternalRow, expressions}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.avro.SchemaConverters
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BoundReference, Expression, InterpretedPredicate}
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils}
+import org.apache.spark.sql.execution.datasources.{FileIndex, 
PartitionDirectory, PartitionUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+  * A File Index which support partition prune for hoodie snapshot and 
read-optimized
+  * query.
+  * Main steps to get the file list for query:
+  * 1、Load all files and partition values from the table path.
+  * 2、Do the partition prune by the partition filter condition.
+  *
+  * There are 3 cases for this:
+  * 1、If the partition columns size is equal to the actually partition path 
level, we
+  * read it as partitioned table.(e.g partition column is "dt", the partition 
path is "2021-03-10")
+  *
+  * 2、If the partition columns size is not equal to the partition path level, 
but the partition
+  * column size is "1" (e.g. partition column is "dt", but the partition path 
is "2021/03/10"
+  * who'es directory level is 3).We can still read it as a partitioned table. 
We will mapping the
+  * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt").
+  *
+  * 3、Else the the partition columns size is not equal to the partition 
directory level and the
+  * size is great than "1" (e.g. partition column is "dt,hh", the partition 
path is "2021/03/10/12")
+  * , we read it as a None Partitioned table because we cannot know how to 
mapping the partition
+  * path with the partition columns in this case.
+  */
+case class HoodieFileIndex(
+ spark: SparkSession,
+ basePath: String,
+ schemaSpec: Option[StructType],
+ options: Map[String, String])
+  extends FileIndex with Logging {
+
+  @transient private val hadoopConf = spark.sessionState.newHadoopConf()
+  private lazy val metaClient = HoodieTableMetaClient
+.builder().setConf(hadoopConf).setBasePath(basePath).build()
+
+  @transient private val queryPath = new Path(options.getOrElse("path", 
"'path' option required"))
+  /**
+* Get the schema of the

[GitHub] [hudi] shenbinglife commented on issue #2689: [SUPPORT] Does a cow table support being writing by mor type? and a mor table support being writing cow type?

2021-03-22 Thread GitBox



shenbinglife commented on issue #2689:
URL: https://github.com/apache/hudi/issues/2689#issuecomment-804514259


   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] shenbinglife closed issue #2689: [SUPPORT] Does a cow table support being writing by mor type? and a mor table support being writing cow type?

2021-03-22 Thread GitBox



shenbinglife closed issue #2689:
URL: https://github.com/apache/hudi/issues/2689


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 commented on issue #2657: [SUPPORT] SparkSQL/Hive query fails if there are two or more record array fields in MOR table.

2021-03-22 Thread GitBox



garyli1019 commented on issue #2657:
URL: https://github.com/apache/hudi/issues/2657#issuecomment-804510311


   sorry for the delay, I will try to reproduce this once finish the release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2689: [SUPPORT] Does a cow table support being writing by mor type? and a mor table support being writing cow type?

2021-03-22 Thread GitBox



bvaradar commented on issue #2689:
URL: https://github.com/apache/hudi/issues/2689#issuecomment-804501991


   You can think of MOR being a superset in terms of functionality compared to 
COW table. So, if you have  an existing COW table, it should be straightforward 
to make it MOR by setting the table type in hoodie.properties. But, the 
opposite migration is not straightforward as we need to ensure there are no 
pending compactions before an MOR table can be converted to COW.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-57) [UMBRELLA] Support ORC Storage

2021-03-22 Thread mithalee mohapatra (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306671#comment-17306671
 ] 

mithalee mohapatra commented on HUDI-57:


Hi.I am planning to generate orc files from Hudi. Is this task still under 
development?

> [UMBRELLA] Support ORC Storage
> --
>
> Key: HUDI-57
> URL: https://issues.apache.org/jira/browse/HUDI-57
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Mani Jindal
>Priority: Major
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 0.8.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/uber/hudi/issues/68]
> https://github.com/uber/hudi/issues/155



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] vinothchandar commented on issue #2672: [SUPPORT] Hang during MOR Upsert after a billion records

2021-03-22 Thread GitBox



vinothchandar commented on issue #2672:
URL: https://github.com/apache/hudi/issues/2672#issuecomment-804481423


   @stackfun JVM defaults to the stop-the-world collector. So without much 
memory, you could also be facing high gc times. 
   Some of these are listed here 
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide 
   
   Is the job happy now? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash commented on pull request #2701: [HUDI 1623] New Hoodie Instant on disk format with end time and milliseconds granularity

2021-03-22 Thread GitBox



n3nash commented on pull request #2701:
URL: https://github.com/apache/hudi/pull/2701#issuecomment-804432752


   @vinothchandar Can you take an early cursory look at this PR ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2656: HUDI insert operation is working same as upsert

2021-03-22 Thread GitBox



nsivabalan commented on issue #2656:
URL: https://github.com/apache/hudi/issues/2656#issuecomment-804379000


   I am not sure sure on case sensitivity on operation type. Can you try 
"insert" as operation type instead of "INSERT". From what I know, "insert" 
operation should not update, but just add incoming records as new records. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vburenin commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-03-22 Thread GitBox



vburenin commented on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-804279383


   Hm, you are using 1.x. I am on 2.1.x. 2.2 seems borked a little. In my case 
gcs-connector is baked into spark image.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stackfun commented on issue #2672: [SUPPORT] Hang during MOR Upsert after a billion records

2021-03-22 Thread GitBox



stackfun commented on issue #2672:
URL: https://github.com/apache/hudi/issues/2672#issuecomment-804263668


   Seems like I mitigated the hangs by changing the following spark configs
   `"spark.executor.memory": "6g"`
   
   And I changed the following hudi configs. 
   `"hoodie.memory.merge.fraction": "0.75"`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stackfun commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-03-22 Thread GitBox



stackfun commented on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-804261225


   Yes, I am using the gcs connector. Seems like it is configured automatically 
when submitting spark jobs through dataproc on GCP.
   
   When using hudi-cli.sh on the master node of a dataproc cluster, I export 
the following before running the script. `export 
CLIENT_JAR=/usr/local/share/google/dataproc/lib/gcs-connector.jar:/usr/local/share/gogle/dataproc/lib/gcs-connector-hadoop2-1.9.17.jar`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vburenin commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-03-22 Thread GitBox



vburenin commented on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-804242670


   @stackfun What't your config setup to connect to GCS? In my case I use gcs 
connector.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vburenin edited a comment on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-03-22 Thread GitBox



vburenin edited a comment on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-804220408


   @nsivabalan Nope, but there are huge data loses with hudi 0.5.0 with MoR. I 
haven't tried MoR table with 0.7.0, only CoW.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vburenin commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-03-22 Thread GitBox



vburenin commented on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-804220408


   @nsivabalan Nope, but there is a huge data loses with hudi 0.5.0 with MoR. I 
haven't tried MoR table with 0.7.0, only CoW.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] gopi-t2s commented on issue #2406: [SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS

2021-03-22 Thread GitBox



gopi-t2s commented on issue #2406:
URL: https://github.com/apache/hudi/issues/2406#issuecomment-804192510


   Thanks @nsivabalan for confirming.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2657: [SUPPORT] SparkSQL/Hive query fails if there are two or more record array fields in MOR table.

2021-03-22 Thread GitBox



nsivabalan commented on issue #2657:
URL: https://github.com/apache/hudi/issues/2657#issuecomment-804179974


   @garyli1019 : when you get time, would appreciate if you can follow up on 
this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #2664: [SUPPORT] Spark empty dataframe problem

2021-03-22 Thread GitBox



nsivabalan closed issue #2664:
URL: https://github.com/apache/hudi/issues/2664


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution

2021-03-22 Thread GitBox



nsivabalan commented on issue #2675:
URL: https://github.com/apache/hudi/issues/2675#issuecomment-804178486


   You can add null as default value for your new field if that would work for 
you. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution

2021-03-22 Thread GitBox



nsivabalan commented on issue #2675:
URL: https://github.com/apache/hudi/issues/2675#issuecomment-804177670


   Yeah, hudi just relies on Avro's schema compatibility in general. From the 
[specification](http://avro.apache.org/docs/current/spec.html#Schema+Resolution),
 looks like adding a new field w/o default will error out. 
   ```
   if the reader's record schema has a field with no default value, and 
writer's schema does not have a field with the same name, an error is signalled.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2664: [SUPPORT] Spark empty dataframe problem

2021-03-22 Thread GitBox



nsivabalan commented on issue #2664:
URL: https://github.com/apache/hudi/issues/2664#issuecomment-804178816


   thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh

2021-03-22 Thread GitBox



nsivabalan commented on issue #2680:
URL: https://github.com/apache/hudi/issues/2680#issuecomment-804134521


   @n3nash : Can you help here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2688: [SUPPORT] Sync to Hive using Metastore

2021-03-22 Thread GitBox



nsivabalan commented on issue #2688:
URL: https://github.com/apache/hudi/issues/2688#issuecomment-804132641


   @n3nash : Can you help here or loop in someone who has exp w/ hive 
metastore. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-03-22 Thread GitBox



nsivabalan commented on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-804132173


   @vburenin : Do you know of any such issues. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-03-22 Thread GitBox



nsivabalan commented on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-804131933


   interesting. I will have to try it out locally to reproduce. Will keep you 
posted. thanks for reporting. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] liujinhui1994 closed pull request #2706: [RFC-20][HUDI-648]ERROR TABLE TEST CI

2021-03-22 Thread GitBox



liujinhui1994 closed pull request #2706:
URL: https://github.com/apache/hudi/pull/2706


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 closed pull request #2706: [RFC-20][HUDI-648]ERROR TABLE TEST CI

2021-03-22 Thread GitBox



garyli1019 closed pull request #2706:
URL: https://github.com/apache/hudi/pull/2706


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table

2021-03-22 Thread GitBox



nsivabalan commented on issue #2637:
URL: https://github.com/apache/hudi/issues/2637#issuecomment-804036829


   what was the issue or fix. Do you mind updating it here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2406: [SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS

2021-03-22 Thread GitBox



nsivabalan commented on issue #2406:
URL: https://github.com/apache/hudi/issues/2406#issuecomment-804036242


   yes, as you could see from commit, it was merged 2 to 3 weeks back. We have 
an upcoming release in a week or two. So, you should have it in 0.8.0. If you 
want to verify the fix, you can pull in latest master and try it out. 0.7.0 
does not have this fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] gopi-t2s commented on issue #2406: [SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS

2021-03-22 Thread GitBox



gopi-t2s commented on issue #2406:
URL: https://github.com/apache/hudi/issues/2406#issuecomment-804020555


   Hi @nsivabalan, 
   @SureshK-T2S and me working together to setup multi table delta streamer
   
   I downloaded the latest version(0.7.0) of hudi-utilities-bundle.jar from 
MAVEN 
reporisitory(https://mvnrepository.com/artifact/org.apache.hudi/hudi-utilities-bundle_2.11/0.7.0)
 and tried to run the spark-submit multi table delta streamer command without 
providing the schema provider class(hope this is not mandatory now after this 
fix #2577).
   
   But still receiving the same error mentioned above by Suresh.
   **ERROR LOG:**
   `Exception in thread "main" java.lang.NullPointerException
at 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.populateSchemaProviderProps(HoodieMultiTableDeltaStreamer.java:150)
at 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.populateTableExecutionContextList(HoodieMultiTableDeltaStreamer.java:130)
at 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.(HoodieMultiTableDeltaStreamer.java:80)
at 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.main(HoodieMultiTableDeltaStreamer.java:203)`
   
   **SPARK SUBMIT COMMAND**
   `spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer ` `ls 
~/hudi/hudi-utilities-bundle_2.11-0.7.0.jar``\
 --table-type COPY_ON_WRITE \
--props s3://path/s3_source.properties \
 --config-folder s3://folder-path \
 --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--source-ordering-field updated_at \
 --base-path-prefix s3://object --target-table dummy_table --op UPSERT`
   
   Do I miss anything here or the above PR is not merged in 0.7.0 version maven 
jar.
   Could you share your valuable thoughts here.
   
   Thank you..
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table

2021-03-22 Thread GitBox



Sugamber commented on issue #2637:
URL: https://github.com/apache/hudi/issues/2637#issuecomment-804002453


   I'm able to resolve class not found exception.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] liujinhui1994 opened a new pull request #2706: [RFC-20][HUDI-648]WIP

2021-03-22 Thread GitBox



liujinhui1994 opened a new pull request #2706:
URL: https://github.com/apache/hudi/pull/2706


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 closed pull request #2702: [HUDI-1710] Read optimized query type for Flink batch reader

2021-03-22 Thread GitBox



danny0405 closed pull request #2702:
URL: https://github.com/apache/hudi/pull/2702


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #2702: [HUDI-1710] Read optimized query type for Flink batch reader

2021-03-22 Thread GitBox



danny0405 commented on pull request #2702:
URL: https://github.com/apache/hudi/pull/2702#issuecomment-803993688


   Close and reopen to re-trigger the CI tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2702: [HUDI-1710] Read optimized query type for Flink batch reader

2021-03-22 Thread GitBox



codecov-io edited a comment on pull request #2702:
URL: https://github.com/apache/hudi/pull/2702#issuecomment-803982515


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=h1) Report
   > Merging 
[#2702](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=desc) (701c84d) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/ce3e8ec87083ef4cd4f33de39b6697f66ff3f277?el=desc)
 (ce3e8ec) will **decrease** coverage by `0.03%`.
   > The diff coverage is `50.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2702/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2702  +/-   ##
   
   - Coverage 51.76%   51.73%   -0.04% 
   + Complexity 3602 3601   -1 
   
 Files   476  476  
 Lines 2257922592  +13 
 Branches   2408 2409   +1 
   
   - Hits  1168811687   -1 
   - Misses 9874 9886  +12 
   - Partials   1017 1019   +2 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.92% <ø> (-0.01%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `54.13% <50.00%> (-0.15%)` | `0.00 <0.00> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `45.58% <ø> (-0.12%)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.73% <ø> (-0.06%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh)
 | `61.44% <50.00%> (-3.53%)` | `28.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `71.37% <0.00%> (-0.35%)` | `55.00% <0.00%> (-1.00%)` | |
   | 
[...g/apache/hudi/common/config/LockConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Mb2NrQ29uZmlndXJhdGlvbi5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | 
[...rg/apache/hudi/hive/HiveMetastoreLockProvider.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZU1ldGFzdG9yZUxvY2tQcm92aWRlci5qYXZh)
 | | | |
   | 
[...ache/hudi/hive/HiveMetastoreBasedLockProvider.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZU1ldGFzdG9yZUJhc2VkTG9ja1Byb3ZpZGVyLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (?%)` | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io commented on pull request #2702: [HUDI-1710] Read optimized query type for Flink batch reader

2021-03-22 Thread GitBox



codecov-io commented on pull request #2702:
URL: https://github.com/apache/hudi/pull/2702#issuecomment-803982515


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=h1) Report
   > Merging 
[#2702](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=desc) (701c84d) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/ce3e8ec87083ef4cd4f33de39b6697f66ff3f277?el=desc)
 (ce3e8ec) will **decrease** coverage by `1.51%`.
   > The diff coverage is `50.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2702/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2702  +/-   ##
   
   - Coverage 51.76%   50.24%   -1.52% 
   + Complexity 3602 3216 -386 
   
 Files   476  418  -58 
 Lines 2257919411-3168 
 Branches   2408 2050 -358 
   
   - Hits  11688 9754-1934 
   + Misses 9874 8831-1043 
   + Partials   1017  826 -191 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.92% <ø> (-0.01%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `54.13% <50.00%> (-0.15%)` | `0.00 <0.00> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.73% <ø> (-0.06%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh)
 | `61.44% <50.00%> (-3.53%)` | `28.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `71.37% <0.00%> (-0.35%)` | `55.00% <0.00%> (-1.00%)` | |
   | 
[...g/apache/hudi/common/config/LockConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Mb2NrQ29uZmlndXJhdGlvbi5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../hive/SlashEncodedHourPartitionValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvU2xhc2hFbmNvZGVkSG91clBhcnRpdGlvblZhbHVlRXh0cmFjdG9yLmphdmE=)
 | | | |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh)
 | | | |
   | 
[...va/org/apache/hudi/hive/util/ColumnNameXLator.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvdXRpbC9Db2x1bW5OYW1lWExhdG9yLmphdmE=)
 | | | |
   | 
[...rg/apache/hudi/hive/HiveMetastoreLockProvider.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZU1ldGFzdG9yZUxvY2tQcm92aWRlci5qYXZh)
 | | | |
   | 
[...g/apache/hudi/MergeOnReadIncrementalRelation.scala](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL01lcmdlT25SZWFkSW5jcmVtZW50YWxSZWxhdGlvbi5zY2FsYQ==)
 | | | |
   | 
[...in/scala/org/apache/hudi/HoodieEmptyRelation.scala](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUVtcHR5UmVsYXRpb24uc2NhbGE=)
 | | | |
   | 
[...in/scala/org/apache/hudi/IncrementalRelation.scala](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0luY3JlbWVudGFsUmVsYXRpb24uc2NhbGE=)
 | | | |
   | ... and [51 
more](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree-more) | |
   


-- 
This is an

[GitHub] [hudi] cdmikechen opened a new issue #2705: [SUPPORT] Can not read data schema using Spark3.0.2 on k8s with hudi-utilities (build in 2.12 and spark3)

2021-03-22 Thread GitBox



cdmikechen opened a new issue #2705:
URL: https://github.com/apache/hudi/issues/2705


   **Describe the problem you faced**
   
   I use spark operator on openshift 4.6 to receive Kafka data and insert data 
to hudi table. I use `hudi-utilities_2.12` (maven build in 2.12 and spark3), 
and use debezium to read mysql binlog.
   When spark read kafka data, It shows the following error in *Stacktrace*
   I don't know if this is bug in hudi 0.7.0 with spark3 or spark3 has a 
problem with the structure of avro. The same program can be run in Hudi 0.6.0 
based on spark on yarn.
   the debezium avro schema is 
   ```json
   [{
"type": "record",
"name": "hoodie_source",
"namespace": "hoodie.source",
"fields": [{
"name": "before",
"type": [{
"type": "record",
"name": "before",
"namespace": "hoodie.source.hoodie_source",
"fields": [{
"name": "id",
"type": "int"
}, {
"name": "name",
"type": ["string", "null"]
}, {
"name": "type",
"type": ["string", "null"]
}, {
"name": "url",
"type": ["string", "null"]
}, {
"name": "user",
"type": ["string", "null"]
}, {
"name": "password",
"type": ["string", "null"]
}, {
"name": "create_time",
"type": ["string", "null"]
}, {
"name": "create_user",
"type": ["string", "null"]
}, {
"name": "update_time",
"type": ["string", "null"]
}, {
"name": "update_user",
"type": ["string", "null"]
}, {
"name": "del_flag",
"type": ["int", "null"]
}]
}, "null"]
}, {
"name": "after",
"type": [{
"type": "record",
"name": "after",
"namespace": "hoodie.source.hoodie_source",
"fields": [{
"name": "id",
"type": "int"
}, {
"name": "name",
"type": ["string", "null"]
}, {
"name": "type",
"type": ["string", "null"]
}, {
"name": "url",
"type": ["string", "null"]
}, {
"name": "user",
"type": ["string", "null"]
}, {
"name": "password",
"type": ["string", "null"]
}, {
"name": "create_time",
"type": ["string", "null"]
}, {
"name": "create_user",
"type": ["string", "null"]
}, {
"name": "update_time",
"type": ["string", "null"]
}, {
"name": "update_user",
"type": ["string", "null"]
}, {
"name": "del_flag",
"type": ["int", "null"]
}]
}, "null"]
}, {
"name": "source",
"type": {
"type": "record",
"name": "source",
"namespace": "hoodie.source.hoodie_source",
"fields": [{
"name": "version",
"type": "string"
}, {
"name": "connector",
"type": "string"
}, {
"name": "name",
"type": "string"
}, {

[GitHub] [hudi] liujinhui1994 closed pull request #2704: [RFC-20] WIP

2021-03-22 Thread GitBox



liujinhui1994 closed pull request #2704:
URL: https://github.com/apache/hudi/pull/2704


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] liujinhui1994 opened a new pull request #2704: [RFC-20] WIP

2021-03-22 Thread GitBox



liujinhui1994 opened a new pull request #2704:
URL: https://github.com/apache/hudi/pull/2704


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash merged pull request #2699: [HUDI-1709] Improving lock config names and adding hive metastore uri config

2021-03-22 Thread GitBox



n3nash merged pull request #2699:
URL: https://github.com/apache/hudi/pull/2699


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-1709] Improving config names and adding hive metastore uri config (#2699)

2021-03-22 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d7b1878  [HUDI-1709] Improving config names and adding hive metastore 
uri config (#2699)
d7b1878 is described below

commit d7b18783bdd6edd6355ee68714982401d3321f86
Author: n3nash 
AuthorDate: Mon Mar 22 01:22:06 2021 -0700

[HUDI-1709] Improving config names and adding hive metastore uri config 
(#2699)
---
 .../transaction/lock/ZookeeperBasedLockProvider.java |  3 ++-
 .../java/org/apache/hudi/config/HoodieLockConfig.java| 15 +++
 .../org/apache/hudi/common/config/LockConfiguration.java | 11 +++
 .../hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java |  2 +-
 ...Provider.java => HiveMetastoreBasedLockProvider.java} | 16 +++-
 ...ider.java => TestHiveMetastoreBasedLockProvider.java} | 10 +-
 .../utilities/functional/TestHoodieDeltaStreamer.java|  2 +-
 7 files changed, 42 insertions(+), 17 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/ZookeeperBasedLockProvider.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/ZookeeperBasedLockProvider.java
index 60336c5..8a80685 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/ZookeeperBasedLockProvider.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/ZookeeperBasedLockProvider.java
@@ -39,6 +39,7 @@ import java.util.concurrent.TimeUnit;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_CONNECTION_TIMEOUT_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_SESSION_TIMEOUT_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_NUM_RETRIES_PROP;
+import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_MAX_WAIT_TIME_IN_MILLIS_PROP;
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP;
 import static 
org.apache.hudi.common.config.LockConfiguration.ZK_BASE_PATH_PROP;
 import static 
org.apache.hudi.common.config.LockConfiguration.ZK_CONNECTION_TIMEOUT_MS_PROP;
@@ -65,7 +66,7 @@ public class ZookeeperBasedLockProvider implements 
LockProvider {
+public class HiveMetastoreBasedLockProvider implements 
LockProvider {
 
-  private static final Logger LOG = 
LogManager.getLogger(HiveMetastoreLockProvider.class);
+  private static final Logger LOG = 
LogManager.getLogger(HiveMetastoreBasedLockProvider.class);
 
   private final String databaseName;
   private final String tableName;
+  private final String hiveMetastoreUris;
   private IMetaStoreClient hiveClient;
   private volatile LockResponse lock = null;
   protected LockConfiguration lockConfiguration;
   ExecutorService executor = Executors.newSingleThreadExecutor();
 
-  public HiveMetastoreLockProvider(final LockConfiguration lockConfiguration, 
final Configuration conf) {
+  public HiveMetastoreBasedLockProvider(final LockConfiguration 
lockConfiguration, final Configuration conf) {
 this(lockConfiguration);
 try {
   HiveConf hiveConf = new HiveConf();
@@ -91,16 +93,17 @@ public class HiveMetastoreLockProvider implements 
LockProvider {
 }
   }
 
-  public HiveMetastoreLockProvider(final LockConfiguration lockConfiguration, 
final IMetaStoreClient metaStoreClient) {
+  public HiveMetastoreBasedLockProvider(final LockConfiguration 
lockConfiguration, final IMetaStoreClient metaStoreClient) {
 this(lockConfiguration);
 this.hiveClient = metaStoreClient;
   }
 
-  HiveMetastoreLockProvider(final LockConfiguration lockConfiguration) {
+  HiveMetastoreBasedLockProvider(final LockConfiguration lockConfiguration) {
 checkRequiredProps(lockConfiguration);
 this.lockConfiguration = lockConfiguration;
 this.databaseName = 
this.lockConfiguration.getConfig().getString(HIVE_DATABASE_NAME_PROP);
 this.tableName = 
this.lockConfiguration.getConfig().getString(HIVE_TABLE_NAME_PROP);
+this.hiveMetastoreUris = 
this.lockConfiguration.getConfig().getOrDefault(HIVE_METASTORE_URI_PROP, 
"").toString();
   }
 
   @Override
@@ -206,6 +209,9 @@ public class HiveMetastoreLockProvider implements 
LockProvider {
   }
 
   private void setHiveLockConfs(HiveConf hiveConf) {
+if (!StringUtils.isNullOrEmpty(this.hiveMetastoreUris)) {
+  hiveConf.setVar(HiveConf.ConfVars.METASTOREURIS, this.hiveMetastoreUris);
+}
 hiveConf.set("hive.support.concurrency", "true");
 hiveConf.set("hive.lock.manager", 
"org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager");
 hiveConf.set("hive.lock.numretries", 
lockConfiguration.getConfig().getString(LOCK_ACQUIRE_NUM_RETRIES_PROP));
diff --git

[hudi] branch asf-site updated: [HUDI-1679] Concurrency Control in Hudi (#2698)

2021-03-22 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new e5caa41  [HUDI-1679] Concurrency Control in Hudi (#2698)
e5caa41 is described below

commit e5caa41de16cfe3213e638237e0dff46ddf1bf96
Author: n3nash 
AuthorDate: Mon Mar 22 01:20:15 2021 -0700

[HUDI-1679] Concurrency Control in Hudi (#2698)
---
 docs/_data/navigation.yml |   2 +
 docs/_docs/2_4_configurations.md  |  63 ++
 docs/_docs/2_9_concurrency_control.md | 151 ++
 3 files changed, 216 insertions(+)

diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
index 5803a43..114bed3 100644
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -28,6 +28,8 @@ docs:
 url: /docs/use_cases.html
   - title: "Writing Data"
 url: /docs/writing_data.html
+  - title: "Concurrency Control"
+url: /docs/concurrency_control.html
   - title: "Querying Data"
 url: /docs/querying_data.html
   - title: "Configuration"
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index ec35e64..e176550 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -824,3 +824,66 @@ Property: `hoodie.write.commit.callback.kafka.acks` 
 # CALLBACK_KAFKA_RETRIES
 Property: `hoodie.write.commit.callback.kafka.retries` 
 Times to retry. 3 by default
+
+### Locking configs
+Configs that control locking mechanisms if 
[WriteConcurrencyMode=optimistic_concurrency_control](#WriteConcurrencyMode) is 
enabled
+[withLockConfig](#withLockConfig) (HoodieLockConfig) 
+
+ withLockProvider(lockProvider = 
org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider) 
{#withLockProvider}
+Property: `hoodie.writer.lock.provider` 
+Lock provider class name, user can provide their own 
implementation of LockProvider which should be subclass of 
org.apache.hudi.common.lock.LockProvider
+
+ withZkQuorum(zkQuorum) {#withZkQuorum}
+Property: `hoodie.writer.lock.zookeeper.url` 
+Set the list of comma separated servers to connect 
to
+
+ withZkBasePath(zkBasePath) {#withZkBasePath}
+Property: `hoodie.writer.lock.zookeeper.base_path` [Required] 
+The base path on Zookeeper under which to create a 
ZNode to acquire the lock. This should be common for all jobs writing to the 
same table
+
+ withZkPort(zkPort) {#withZkPort}
+Property: `hoodie.writer.lock.zookeeper.port` [Required] 
+The connection port to be used for Zookeeper
+
+ withZkLockKey(zkLockKey) {#withZkLockKey}
+Property: `hoodie.writer.lock.zookeeper.lock_key` [Required] 
+Key name under base_path at which to create a ZNode 
and acquire lock. Final path on zk will look like base_path/lock_key. We 
recommend setting this to the table name
+
+ withZkConnectionTimeoutInMs(connectionTimeoutInMs = 15000) 
{#withZkConnectionTimeoutInMs}
+Property: `hoodie.writer.lock.zookeeper.connection_timeout_ms` 
+How long to wait when connecting to ZooKeeper before 
considering the connection a failure
+
+ withZkSessionTimeoutInMs(sessionTimeoutInMs = 6) 
{#withZkSessionTimeoutInMs}
+Property: `hoodie.writer.lock.zookeeper.session_timeout_ms` 
+How long to wait after losing a connection to 
ZooKeeper before the session is expired
+
+ withNumRetries(num_retries = 3) {#withNumRetries}
+Property: `hoodie.writer.lock.num_retries` 
+Maximum number of times to retry by lock provider 
client
+
+ withRetryWaitTimeInMillis(retryWaitTimeInMillis = 5000) 
{#withRetryWaitTimeInMillis}
+Property: `hoodie.writer.lock.wait_time_ms_between_retry` 
+Initial amount of time to wait between retries by 
lock provider client
+
+ withHiveDatabaseName(hiveDatabaseName) {#withHiveDatabaseName}
+Property: `hoodie.writer.lock.hivemetastore.database` [Required] 
+The Hive database to acquire lock against
+
+ withHiveTableName(hiveTableName) {#withHiveTableName}
+Property: `hoodie.writer.lock.hivemetastore.table` [Required] 
+The Hive table under the hive database to acquire 
lock against
+
+ withClientNumRetries(clientNumRetries = 0) {#withClientNumRetries}
+Property: `hoodie.writer.lock.client.num_retries` 
+Maximum number of times to retry to acquire lock 
additionally from the hudi client
+
+ withRetryWaitTimeInMillis(retryWaitTimeInMillis = 1) 
{#withRetryWaitTimeInMillis}
+Property: `hoodie.writer.lock.client.wait_time_ms_between_retry` 
+Amount of time to wait between retries from the hudi 
client
+
+ withConflictResolutionStrategy(lockProvider = 
org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy)
 {#withConflictResolutionStrategy}
+Property: `hoodie.writer.lock.conflict.resolution.strategy` 
+Lock provider class name, this should be subclass of

[GitHub] [hudi] n3nash merged pull request #2698: [HUDI-1679] Concurrency Control in Hudi

2021-03-22 Thread GitBox



n3nash merged pull request #2698:
URL: https://github.com/apache/hudi/pull/2698


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash commented on pull request #2698: [HUDI-1679] Concurrency Control in Hudi

2021-03-22 Thread GitBox



n3nash commented on pull request #2698:
URL: https://github.com/apache/hudi/pull/2698#issuecomment-803862184


   @nsivabalan Yes, I already built it locally to confirm rendering. Attaching 
pictures for future reference.
   
   https://user-images.githubusercontent.com/2722167/111960187-a5775000-8aac-11eb-8c65-91dea58f4f7f.png;>
   
   https://user-images.githubusercontent.com/2722167/111960199-a9a36d80-8aac-11eb-8265-5a3edcecf44a.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] youngyangp closed issue #2700: [SUPPORT] Hive sync error by using DataFrameWriter

2021-03-22 Thread GitBox



youngyangp closed issue #2700:
URL: https://github.com/apache/hudi/issues/2700


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] legendtkl opened a new pull request #2703: [DOCUMENT] update README doc for integ test

2021-03-22 Thread GitBox



legendtkl opened a new pull request #2703:
URL: https://github.com/apache/hudi/pull/2703


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

57 matches

Mail list logo