[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r599288753 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter { val archiveLogFolder = parameters.getOrElse( HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived") +val partitionColumns = parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null) + val tableMetaClient = HoodieTableMetaClient.withPropertyBuilder() .setTableType(tableType) .setTableName(tblName) .setArchiveLogFolder(archiveLogFolder) .setPayloadClassName(parameters(PAYLOAD_CLASS_OPT_KEY)) .setPreCombineField(parameters.getOrDefault(PRECOMBINE_FIELD_OPT_KEY, null)) + .setPartitionColumns(partitionColumns) Review comment: Thank you for reminding me about this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
Sugamber commented on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-804639531 public class PartialColumnUpdate implements HoodieRecordPayload { private static final Logger logger = Logger.getLogger(PartialColumnUpdate.class); private byte[] recordBytes; private Schema schema; private Comparable orderingVal; public PartialColumnUpdate(GenericRecord genericRecord, Comparable orderingVal) { logger.info("Inside two parameter cons"); try { if (genericRecord != null) { this.recordBytes = HoodieAvroUtils.avroToBytes(genericRecord); this.schema = genericRecord.getSchema(); this.orderingVal = orderingVal; } else { this.recordBytes = new byte[0]; } } catch (Exception io) { throw new RuntimeException("Cannot convert record to bytes ", io); } } public PartialColumnUpdate(Option record) { this(record.isPresent() ? record.get() : null, 0); } @Override public PartialColumnUpdate preCombine(PartialColumnUpdate anotherRecord) { logger.info("Inside PreCombine"); logger.info("preCombine => " + anotherRecord); logger.info("another_ordering value" + anotherRecord.orderingVal); logger.info("another_ schema value" + anotherRecord.schema); logger.info("another_ record bytes value" + anotherRecord.recordBytes); if (anotherRecord.orderingVal.compareTo(orderingVal) > 0) { return anotherRecord; } else { return this; } } @Override public Option combineAndGetUpdateValue(IndexedRecord indexedRecord, Schema currentSchema) throws IOException { logger.info("Inside combineAndGetUpdateValue"); logger.info("current schema" + currentSchema); logger.info("combineUpdate - >" + Option.of(indexedRecord)); getInsertValue(currentSchema); return Option.empty(); } @Override public Option getInsertValue(Schema schema) throws IOException { logger.info("Inside getInsertValue"); if (recordBytes.length == 0) { return Option.empty(); } IndexedRecord indexedRecord = HoodieAvroUtils.bytesToAvro(recordBytes, schema); if (isDeleteRecord((GenericRecord) indexedRecord)) { return Option.empty(); } else { return Option.of(indexedRecord); } } protected boolean isDeleteRecord(GenericRecord genericRecord) { final String isDeleteKey = "_hoodie_is_deleted"; if (genericRecord.getSchema().getField(isDeleteKey) == null) { return false; } Object deleteMarker = genericRecord.get(isDeleteKey); return (deleteMarker instanceof Boolean && (boolean) deleteMarker); } } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Sugamber edited a comment on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
Sugamber edited a comment on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-804636424 I have created one class after implementing HoodieRecordPayload. We have three methods for which we have to write our logic. 1. preCombine 2. combineAndGetUpdateValue 3. getInsertValue @n3nash As per your above explanation, preCombine would provide the current record which is coming in incremental load and combineAndGetUpdateValue will provide the latest records from hoodie table. Please correct me if my understanding is incorrect. In my use case , I'm only getting few columns out of 20 in incremental data. preCombine method does not have any schema details. For Example - Hudi table built with 20 columns. Now, requirement is to update only 3 columns and only these columns data is coming from incremental data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and PRECOMBINE_FIELD_OPT_KEY column. I have implemented the class as below. Please let me know in which method, I'll be getting full schema of the table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r599287887 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -179,6 +179,9 @@ public static final String EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION = AVRO_SCHEMA + ".externalTransformation"; public static final String DEFAULT_EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION = "false"; + public static final String MAX_LISTING_PARALLELISM = "hoodie.max.list.file.parallelism"; + public static final Integer DEFAULT_MAX_LISTING_PARALLELISM = 200; Review comment: Good suggestions! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Sugamber edited a comment on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
Sugamber edited a comment on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-804636424 I have created one class after implementing HoodieRecordPayload. We have three methods for which we have to write our logic. 1. preCombine 2. combineAndGetUpdateValue 3. getInsertValue @n3nash As per your above explanation, preCombine would provide the current record which is coming in incremental load and combineAndGetUpdateValue will provide the latest records from hoodie table. Please correct me if my understanding is incorrect. In my use case , I'm only getting few columns out of 20 in incremental data. preCombine method does not have any schema details. For Example - Hudi table built with 20 columns. Now, requirement is to update only 3 columns and only these columns data is coming from incremental data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and PRECOMBINE_FIELD_OPT_KEY column. I have implemented the class as below. Please let me know in which method, I'll be getting full schema of the table. `public class PartialColumnUpdate implements HoodieRecordPayload { private static final Logger logger = Logger.getLogger(PartialColumnUpdate.class); private byte[] recordBytes; private Schema schema; private Comparable orderingVal; public PartialColumnUpdate(GenericRecord genericRecord, Comparable orderingVal) { logger.info("Inside two parameter cons"); try { if (genericRecord != null) { this.recordBytes = HoodieAvroUtils.avroToBytes(genericRecord); this.schema = genericRecord.getSchema(); this.orderingVal = orderingVal; } else { this.recordBytes = new byte[0]; } } catch (Exception io) { throw new RuntimeException("Cannot convert record to bytes ", io); } } public PartialColumnUpdate(Option record) { this(record.isPresent() ? record.get() : null, 0); } @Override public PartialColumnUpdate preCombine(PartialColumnUpdate anotherRecord) { logger.info("Inside PreCombine"); logger.info("preCombine => " + anotherRecord); logger.info("another_ordering value" + anotherRecord.orderingVal); logger.info("another_ schema value" + anotherRecord.schema); logger.info("another_ record bytes value" + anotherRecord.recordBytes); if (anotherRecord.orderingVal.compareTo(orderingVal) > 0) { return anotherRecord; } else { return this; } } @Override public Option combineAndGetUpdateValue(IndexedRecord indexedRecord, Schema currentSchema) throws IOException { logger.info("Inside combineAndGetUpdateValue"); logger.info("current schema" + currentSchema); logger.info("combineUpdate - >" + Option.of(indexedRecord)); getInsertValue(currentSchema); return Option.empty(); } @Override public Option getInsertValue(Schema schema) throws IOException { logger.info("Inside getInsertValue"); if (recordBytes.length == 0) { return Option.empty(); } IndexedRecord indexedRecord = HoodieAvroUtils.bytesToAvro(recordBytes, schema); if (isDeleteRecord((GenericRecord) indexedRecord)) { return Option.empty(); } else { return Option.of(indexedRecord); } } protected boolean isDeleteRecord(GenericRecord genericRecord) { final String isDeleteKey = "_hoodie_is_deleted"; if (genericRecord.getSchema().getField(isDeleteKey) == null) { return false; } Object deleteMarker = genericRecord.get(isDeleteKey); return (deleteMarker instanceof Boolean && (boolean) deleteMarker); } }` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Sugamber edited a comment on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
Sugamber edited a comment on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-804636424 I have created one class after implementing HoodieRecordPayload. We have three methods for which we have to write our logic. 1. preCombine 2. combineAndGetUpdateValue 3. getInsertValue @n3nash As per your above explanation, preCombine would provide the current record which is coming in incremental load and combineAndGetUpdateValue will provide the latest records from hoodie table. Please correct me if my understanding is incorrect. In my use case , I'm only getting few columns out of 20 in incremental data. preCombine method does not have any schema details. For Example - Hudi table built with 20 columns. Now, requirement is to update only 3 columns and only these columns data is coming from incremental data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and PRECOMBINE_FIELD_OPT_KEY column. I have implemented the class as below. Please let me know in which method, I'll be getting full schema of the table. ` public class PartialColumnUpdate implements HoodieRecordPayload { private static final Logger logger = Logger.getLogger(PartialColumnUpdate.class); private byte[] recordBytes; private Schema schema; private Comparable orderingVal; public PartialColumnUpdate(GenericRecord genericRecord, Comparable orderingVal) { logger.info("Inside two parameter cons"); try { if (genericRecord != null) { this.recordBytes = HoodieAvroUtils.avroToBytes(genericRecord); this.schema = genericRecord.getSchema(); this.orderingVal = orderingVal; } else { this.recordBytes = new byte[0]; } } catch (Exception io) { throw new RuntimeException("Cannot convert record to bytes ", io); } } public PartialColumnUpdate(Option record) { this(record.isPresent() ? record.get() : null, 0); } @Override public PartialColumnUpdate preCombine(PartialColumnUpdate anotherRecord) { logger.info("Inside PreCombine"); logger.info("preCombine => " + anotherRecord); logger.info("another_ordering value" + anotherRecord.orderingVal); logger.info("another_ schema value" + anotherRecord.schema); logger.info("another_ record bytes value" + anotherRecord.recordBytes); if (anotherRecord.orderingVal.compareTo(orderingVal) > 0) { return anotherRecord; } else { return this; } } @Override public Option combineAndGetUpdateValue(IndexedRecord indexedRecord, Schema currentSchema) throws IOException { logger.info("Inside combineAndGetUpdateValue"); logger.info("current schema" + currentSchema); logger.info("combineUpdate - >" + Option.of(indexedRecord)); getInsertValue(currentSchema); return Option.empty(); } @Override public Option getInsertValue(Schema schema) throws IOException { logger.info("Inside getInsertValue"); if (recordBytes.length == 0) { return Option.empty(); } IndexedRecord indexedRecord = HoodieAvroUtils.bytesToAvro(recordBytes, schema); if (isDeleteRecord((GenericRecord) indexedRecord)) { return Option.empty(); } else { return Option.of(indexedRecord); } } protected boolean isDeleteRecord(GenericRecord genericRecord) { final String isDeleteKey = "_hoodie_is_deleted"; if (genericRecord.getSchema().getField(isDeleteKey) == null) { return false; } Object deleteMarker = genericRecord.get(isDeleteKey); return (deleteMarker instanceof Boolean && (boolean) deleteMarker); } } ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on issue #2705: [SUPPORT] Can not read data schema using Spark3.0.2 on k8s with hudi-utilities (build in 2.12 and spark3)
cdmikechen commented on issue #2705: URL: https://github.com/apache/hudi/issues/2705#issuecomment-804636641 I've found the problem: There is a new configuration named `hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable` and it is `true` by default. If I use my custom transformer and set `target scheme` null, hudi will not work by a null schema. I set `target scheme` to the same as `source schema` for testing, so that spark will not work and report above errors. If I set `hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable` to false, hudi will successfully deal with the Kafka message and write it to hdfs. However, when synchronizing hive, I encountered the same problem as this https://github.com/apache/hudi/issues/1751#issuecomment-648460431. I think hudi still lost related packets by hive3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
Sugamber commented on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-804636424 I have created one class after implementing HoodieRecordPayload. We have three methods for which we have to write our logic. 1. preCombine 2. combineAndGetUpdateValue 3. getInsertValue @n3nash As per your above explanation, preCombine would provide the current record which is coming in incremental load and combineAndGetUpdateValue will provide the latest records from hoodie table. Please correct me if my understanding is incorrect. In my use case , I'm only getting few columns out of 20 in incremental data. preCombine method does not have any schema details. For Example - Hudi table built with 20 columns. Now, requirement is to update only 3 columns and only these columns data is coming from incremental data feeds along with RECORDKEY_FIELD_OPT_KEY,PARTITIONPATH_FIELD_OPT_KEY and PRECOMBINE_FIELD_OPT_KEY column. I have implemented the class as below. Please let me know in which method, I'll be getting full schema of the table. `public class PartialColumnUpdate implements HoodieRecordPayload { private static final Logger logger = Logger.getLogger(PartialColumnUpdate.class); private byte[] recordBytes; private Schema schema; private Comparable orderingVal; public PartialColumnUpdate(GenericRecord genericRecord, Comparable orderingVal) { logger.info("Inside two parameter cons"); try { if (genericRecord != null) { this.recordBytes = HoodieAvroUtils.avroToBytes(genericRecord); this.schema = genericRecord.getSchema(); this.orderingVal = orderingVal; } else { this.recordBytes = new byte[0]; } } catch (Exception io) { throw new RuntimeException("Cannot convert record to bytes ", io); } } public PartialColumnUpdate(Option record) { this(record.isPresent() ? record.get() : null, 0); } @Override public PartialColumnUpdate preCombine(PartialColumnUpdate anotherRecord) { logger.info("Inside PreCombine"); logger.info("preCombine => " + anotherRecord); logger.info("another_ordering value" + anotherRecord.orderingVal); logger.info("another_ schema value" + anotherRecord.schema); logger.info("another_ record bytes value" + anotherRecord.recordBytes); if (anotherRecord.orderingVal.compareTo(orderingVal) > 0) { return anotherRecord; } else { return this; } } @Override public Option combineAndGetUpdateValue(IndexedRecord indexedRecord, Schema currentSchema) throws IOException { logger.info("Inside combineAndGetUpdateValue"); logger.info("current schema" + currentSchema); logger.info("combineUpdate - >" + Option.of(indexedRecord)); getInsertValue(currentSchema); return Option.empty(); } @Override public Option getInsertValue(Schema schema) throws IOException { logger.info("Inside getInsertValue"); if (recordBytes.length == 0) { return Option.empty(); } IndexedRecord indexedRecord = HoodieAvroUtils.bytesToAvro(recordBytes, schema); if (isDeleteRecord((GenericRecord) indexedRecord)) { return Option.empty(); } else { return Option.of(indexedRecord); } } protected boolean isDeleteRecord(GenericRecord genericRecord) { final String isDeleteKey = "_hoodie_is_deleted"; if (genericRecord.getSchema().getField(isDeleteKey) == null) { return false; } Object deleteMarker = genericRecord.get(isDeleteKey); return (deleteMarker instanceof Boolean && (boolean) deleteMarker); } }` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-57) [UMBRELLA] Support ORC Storage
[ https://issues.apache.org/jira/browse/HUDI-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306779#comment-17306779 ] Vinoth Chandar commented on HUDI-57: [~pwason] can you please update this JIRA? Also should we assign this to the intern? > [UMBRELLA] Support ORC Storage > -- > > Key: HUDI-57 > URL: https://issues.apache.org/jira/browse/HUDI-57 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration, Writer Core >Reporter: Vinoth Chandar >Assignee: Mani Jindal >Priority: Major > Labels: hudi-umbrellas, pull-request-available > Fix For: 0.8.0 > > Time Spent: 20m > Remaining Estimate: 0h > > [https://github.com/uber/hudi/issues/68] > https://github.com/uber/hudi/issues/155 -- This message was sent by Atlassian Jira (v8.3.4#803005)
svn commit: r46705 - /dev/hudi/KEYS
Author: sivabalan Date: Tue Mar 23 04:50:02 2021 New Revision: 46705 Log: Updating Gary's gpg key Modified: dev/hudi/KEYS Modified: dev/hudi/KEYS == --- dev/hudi/KEYS (original) +++ dev/hudi/KEYS Tue Mar 23 04:50:02 2021 @@ -601,3 +601,64 @@ ECppJfvmGRuNapsZ+KCXiY2wjnM9/EopD5Nsr3E7 9ELkv7No+gWT7/64sox1Zo03duuWYR8bGpCJIcd6Qn99dPZSr59o8TGkrPU= =gJ2E -END PGP PUBLIC KEY BLOCK- + +pub rsa4096 2021-03-22 [SC] [expires: 2025-03-22] + E2A9714E0FBA3A087BDEE655E72873D765D6C406 +uid [ultimate] YanJia Li +sig 3E72873D765D6C406 2021-03-22 YanJia Li +sub rsa4096 2021-03-22 [E] [expires: 2025-03-22] +sig E72873D765D6C406 2021-03-22 YanJia Li + +-BEGIN PGP PUBLIC KEY BLOCK- + +mQINBGBYnFQBEADJfxkjdOufvAOu7yP1Q1wiM+FGQIcaFb7mydFc3/PpQwqxAPoS +GlcorwkTMCdqKSxR4+p5B9xnfux9qXOydoKof0srhMLudD7lxZa6xAn0OeC2jeqk +mFhXhw2/r+iuon9x7Rzts0HY7XvM3juQpTNa1cOi2jTsALpOyo2qDhPwNc7MNasC +0OKuE0UwGfcDpd9TILIvOlssTNyHcYumavcDZBW9eZMpGF4jASPQzQ0iXnEAHyEr +I55z9q760qNfAW72SO6vKBJZZVWUoCepGzOaB9VaX3fcYdfuOEm4bfKi4qEEEUaF +aOeAo5jMbu+fhSDPBqfvthRyJitmit4rq49ijXJlwU8++mAEDUcLZ7SNMfnMht/N +NazDmz5wXjFcbyKmaYAkQ/Q+7M161QsGLFq3WGmFej1Yv/nCo3tfM3j3aEc74jzR +ylUQQQE+alJwVdN4CJ5SkyBtjBWMTbSHHagRlFoxnLUSktCOTM31vGVIoi/DrSdD +Opxy6BatTIcUcrEW+XRkqeApmiBS6Oss6H0I0qBQJZL+o5F0wT8lrrwioy/qEzR1 +pgmtccHm4TBfa21CJDyNp8+VqM99fteM57dxBwHerR7vGlRfBjNY/s9SeUwKiNZw +L7pmyQfhWXAN3m88xutpKoGpKwSL5S1rnvJl8N0dqeThSzZOB4i6zjUhQQARAQAB +tB1ZYW5KaWEgTGkgPGdhcnlsaUBhcGFjaGUub3JnPokCVAQTAQgAPhYhBOKpcU4P +ujoIe97mVecoc9dl1sQGBQJgWJxUAhsDBQkHhh8LBQsJCAcCBhUKCQgLAgQWAgMB +Ah4BAheAAAoJEOcoc9dl1sQG9BIQAMJCp6lS5ycQXDE83XL/VaVO8iPIWiZySd7P +Hf/XKab/kFIsXbAPrR5pPkcL8DzlarvklY7tTWfgzgY3yhh5L42eAdgH10Na1JWg +x/JbBGea4I89v8lRMqAcslSmts9TyCZv4aRwwV9bwf9Y7b3WGXrd4gv8fd2XZtfH +7pNNPg/B5XiWfTOQkV0S6I5lnpvgrNed3+BRJn+jYZrLLIlhPck4vShLtCnjm7TR +XNrDilRxpSzs0d8Fzgp+paWuMX+W47CzKnRZGyISQ/KJfBlacEirNEyDy+j4P3er +Pyn77QSFoBVM3SbM4wY40P+SW6bTblY+3ntO4Shb/2USb3J+w8jmwzkUXwmljgMD +ojfvQa3rO5rPfPItdaRRtEH9YQvcYdZtnG7NwRRCc8SoqeJfsqYYEo6Iw8JVJqw9 ++CIBQKie5z7/iS2/DEG4lQx57VzMURdZOoFUOvw6MEdqBqlMmwJyqG3caIXW9f4i +T4TSQCr9M0ziJELCZSHBcJ5W6fB+bhYeRsZer32tcIQTnODuNrgZic7gngLsTC4E +nl1Z6lVNczb0aQ0oGRBVb5dNROdKdSrk8hCyP4MQa2rJ4KenwrL0eyhUiIC8Bg5Q +lLGErxnNP/cTPufTgzMcvVQ0PYliyOlGEUndOw0pFORg/xi1RzCNJQIi/k8bNoEj +6u/gRkIRuQINBGBYnFQBEACzmYLb2UhAnG+Q059H1iTQbLSWektYx1WQY5Q9YjCG +hwwimY1D2ePqVI2OfSwY/aAyM14t70LOeZFG3JtjE7wzuIltGTIPBiVIJRDdIeJv +kWImZw4vbN+kMBhBnQwr7U3KNdwuD70MxHjCuQ2LFmP+Jb/Sv/6/kRr+s42PQAbL +qH0FAA8Qcv99gg/dEC0uOOHB4UE7jEEDqIbedkk6GUy9YtgplbDlk+L1I0sqh7vu +UI2bO42C1jDrNqgKJ4yQNNswriU3iri/i+0kwEFA8/oNIxdVpGiZrfBuwxmNTE4X +A9dCrDpPGIs/gKS/vaykEqHdgi33D1DGtUHuCbUNalb7Er/22PhPbeZkK0sqpuKl +u7AYdJpE1PNIiR6qMdBpQlY9F76BAwNq6gWx3eSrsgXWJ5ar5pLlhbF5j6sPjF/i +DKwBKxY0fSLBoCv2abEgzvkjUwYxUsTXkKrEx0rYUbia2WuP95VjiTrrPs0LQrgp +KG3Zn1FEOHqEN6kY3GeeBIMAnepYiCNS3WaD7RIlRKAkzVC82pYG71tP1/meDXTV +LsGOjQrIDQpLDxtiG1rwRin6tzjKcUDBfIi4y1czAHjzEnx9uHCWvbEZ6KlTI35I +SWFdLoFf2QDkmA6dgC4i0+emP4bZHRMLd7JRo+1ozTZ+hDq+z3QJJZacIH4a02vl +eQARAQABiQI8BBgBCAAmFiEE4qlxTg+6Ogh73uZV5yhz12XWxAYFAmBYnFQCGwwF +CQeGHwsACgkQ5yhz12XWxAbb3w//ai7iGR7WL2Wh6OvXICtS2WxAnXHu8XOsl91f +tf0gx6oTWI0u2VbSqJDKJG5rbUPXyCmJbG32eq3PjTYWS0jT2kQFqkWQ5wX6AqZp +lVNkT0GmmBuHRA71sp1PUHK2DaVDmHaTDncSvcdzDra8d0+/ANZ8licZlXF8D9rz +9zGnxU/mbZ98xUJcVK3w8yea98bTV2cQLlTgYjLfmFoA/a8zyeuIotTUCELIA0Wq +sAs8b0ORVm9Hk4G1q5eBem1FY8CzQQvVngrMUTOZdj1f0KXmM1Vii+T8eU8ukT82 +bScU7YRmO/XdMNaijmrsHmdP1ybW2KuP16m3ZIxXUu/mD6HIYCIFrIuin425E2kT +hSh7xyZQMGRyJ+HlzUKm4d8Mg05SmErDaA+4APN5F6lP47ED0kT8RkRGmGBWWaZU +sHbsjj6WYsEAVUcxErn+DelSS31j9P+8sCyI4Yi9/1IAr5VvYQXrrH3veCRHVjQZ +KK/zA7lkrn26yldsuZXq4DArTmFUhCwRSNDEQgcfh/HOpmT8r7WZEGRBb99xXLyY +HJzrVHbpIPxUpvBFld41Eyepuoij+pY7zyb/mCk5KMPEVK4XyYG9PpuPdER7EFzo +K5pVT2a4wL0e0/ekCsGfbEn+2xubSrfWZ+M3YoIlX6uVykQrKH+NjoUlLuqv7PVF +wV4zPZQ= +=0iU3 +-END PGP PUBLIC KEY BLOCK- +
[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
Sugamber commented on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-804608999 @nsivabalan , I had created shaded jar and it was causing the issues as few dependencies version were conflicting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r599160889 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -179,6 +179,9 @@ public static final String EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION = AVRO_SCHEMA + ".externalTransformation"; public static final String DEFAULT_EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION = "false"; + public static final String MAX_LISTING_PARALLELISM = "hoodie.max.list.file.parallelism"; + public static final Integer DEFAULT_MAX_LISTING_PARALLELISM = 200; Review comment: - I think its fine to use the `DEFAULT_PARALLELISM` i.e `1500` as the default. It is what we use in `FileSystemBackedTableMetadata` as well. - We should add a method here to get this configuration just like all other configurations. ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.{HoodieMetadataConfig, SerializableConfiguration} +import org.apache.hudi.common.engine.HoodieLocalEngineContext +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.HoodieBaseFile +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.{InternalRow, expressions} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.avro.SchemaConverters +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BoundReference, Expression, InterpretedPredicate} +import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils} +import org.apache.spark.sql.execution.datasources.{FileIndex, PartitionDirectory, PartitionUtils} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +/** + * A File Index which support partition prune for hoodie snapshot and read-optimized + * query. + * Main steps to get the file list for query: + * 1、Load all files and partition values from the table path. + * 2、Do the partition prune by the partition filter condition. + * + * There are 3 cases for this: + * 1、If the partition columns size is equal to the actually partition path level, we + * read it as partitioned table.(e.g partition column is "dt", the partition path is "2021-03-10") + * + * 2、If the partition columns size is not equal to the partition path level, but the partition + * column size is "1" (e.g. partition column is "dt", but the partition path is "2021/03/10" + * who'es directory level is 3).We can still read it as a partitioned table. We will mapping the + * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt"). + * + * 3、Else the the partition columns size is not equal to the partition directory level and the + * size is great than "1" (e.g. partition column is "dt,hh", the partition path is "2021/03/10/12") + * , we read it as a None Partitioned table because we cannot know how to mapping the partition + * path with the partition columns in this case. + */ +case class HoodieFileIndex( + spark: SparkSession, + basePath: String, + schemaSpec: Option[StructType], + options: Map[String, String]) + extends FileIndex with Logging { + + @transient private val hadoopConf = spark.sessionState.newHadoopConf() + private lazy val metaClient = HoodieTableMetaClient +.builder().setConf(hadoopConf).setBasePath(basePath).build() + + @transient private val queryPath = new Path(options.getOrElse("path", "'path' option required")) + /** +* Get the schema of the
[GitHub] [hudi] shenbinglife commented on issue #2689: [SUPPORT] Does a cow table support being writing by mor type? and a mor table support being writing cow type?
shenbinglife commented on issue #2689: URL: https://github.com/apache/hudi/issues/2689#issuecomment-804514259 Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenbinglife closed issue #2689: [SUPPORT] Does a cow table support being writing by mor type? and a mor table support being writing cow type?
shenbinglife closed issue #2689: URL: https://github.com/apache/hudi/issues/2689 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on issue #2657: [SUPPORT] SparkSQL/Hive query fails if there are two or more record array fields in MOR table.
garyli1019 commented on issue #2657: URL: https://github.com/apache/hudi/issues/2657#issuecomment-804510311 sorry for the delay, I will try to reproduce this once finish the release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2689: [SUPPORT] Does a cow table support being writing by mor type? and a mor table support being writing cow type?
bvaradar commented on issue #2689: URL: https://github.com/apache/hudi/issues/2689#issuecomment-804501991 You can think of MOR being a superset in terms of functionality compared to COW table. So, if you have an existing COW table, it should be straightforward to make it MOR by setting the table type in hoodie.properties. But, the opposite migration is not straightforward as we need to ensure there are no pending compactions before an MOR table can be converted to COW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-57) [UMBRELLA] Support ORC Storage
[ https://issues.apache.org/jira/browse/HUDI-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306671#comment-17306671 ] mithalee mohapatra commented on HUDI-57: Hi.I am planning to generate orc files from Hudi. Is this task still under development? > [UMBRELLA] Support ORC Storage > -- > > Key: HUDI-57 > URL: https://issues.apache.org/jira/browse/HUDI-57 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration, Writer Core >Reporter: Vinoth Chandar >Assignee: Mani Jindal >Priority: Major > Labels: hudi-umbrellas, pull-request-available > Fix For: 0.8.0 > > Time Spent: 20m > Remaining Estimate: 0h > > [https://github.com/uber/hudi/issues/68] > https://github.com/uber/hudi/issues/155 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] vinothchandar commented on issue #2672: [SUPPORT] Hang during MOR Upsert after a billion records
vinothchandar commented on issue #2672: URL: https://github.com/apache/hudi/issues/2672#issuecomment-804481423 @stackfun JVM defaults to the stop-the-world collector. So without much memory, you could also be facing high gc times. Some of these are listed here https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide Is the job happy now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2701: [HUDI 1623] New Hoodie Instant on disk format with end time and milliseconds granularity
n3nash commented on pull request #2701: URL: https://github.com/apache/hudi/pull/2701#issuecomment-804432752 @vinothchandar Can you take an early cursory look at this PR ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2656: HUDI insert operation is working same as upsert
nsivabalan commented on issue #2656: URL: https://github.com/apache/hudi/issues/2656#issuecomment-804379000 I am not sure sure on case sensitivity on operation type. Can you try "insert" as operation type instead of "INSERT". From what I know, "insert" operation should not update, but just add incoming records as new records. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vburenin commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage
vburenin commented on issue #2692: URL: https://github.com/apache/hudi/issues/2692#issuecomment-804279383 Hm, you are using 1.x. I am on 2.1.x. 2.2 seems borked a little. In my case gcs-connector is baked into spark image. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stackfun commented on issue #2672: [SUPPORT] Hang during MOR Upsert after a billion records
stackfun commented on issue #2672: URL: https://github.com/apache/hudi/issues/2672#issuecomment-804263668 Seems like I mitigated the hangs by changing the following spark configs `"spark.executor.memory": "6g"` And I changed the following hudi configs. `"hoodie.memory.merge.fraction": "0.75"` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stackfun commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage
stackfun commented on issue #2692: URL: https://github.com/apache/hudi/issues/2692#issuecomment-804261225 Yes, I am using the gcs connector. Seems like it is configured automatically when submitting spark jobs through dataproc on GCP. When using hudi-cli.sh on the master node of a dataproc cluster, I export the following before running the script. `export CLIENT_JAR=/usr/local/share/google/dataproc/lib/gcs-connector.jar:/usr/local/share/gogle/dataproc/lib/gcs-connector-hadoop2-1.9.17.jar` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vburenin commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage
vburenin commented on issue #2692: URL: https://github.com/apache/hudi/issues/2692#issuecomment-804242670 @stackfun What't your config setup to connect to GCS? In my case I use gcs connector. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vburenin edited a comment on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage
vburenin edited a comment on issue #2692: URL: https://github.com/apache/hudi/issues/2692#issuecomment-804220408 @nsivabalan Nope, but there are huge data loses with hudi 0.5.0 with MoR. I haven't tried MoR table with 0.7.0, only CoW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vburenin commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage
vburenin commented on issue #2692: URL: https://github.com/apache/hudi/issues/2692#issuecomment-804220408 @nsivabalan Nope, but there is a huge data loses with hudi 0.5.0 with MoR. I haven't tried MoR table with 0.7.0, only CoW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] gopi-t2s commented on issue #2406: [SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS
gopi-t2s commented on issue #2406: URL: https://github.com/apache/hudi/issues/2406#issuecomment-804192510 Thanks @nsivabalan for confirming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2657: [SUPPORT] SparkSQL/Hive query fails if there are two or more record array fields in MOR table.
nsivabalan commented on issue #2657: URL: https://github.com/apache/hudi/issues/2657#issuecomment-804179974 @garyli1019 : when you get time, would appreciate if you can follow up on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #2664: [SUPPORT] Spark empty dataframe problem
nsivabalan closed issue #2664: URL: https://github.com/apache/hudi/issues/2664 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution
nsivabalan commented on issue #2675: URL: https://github.com/apache/hudi/issues/2675#issuecomment-804178486 You can add null as default value for your new field if that would work for you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution
nsivabalan commented on issue #2675: URL: https://github.com/apache/hudi/issues/2675#issuecomment-804177670 Yeah, hudi just relies on Avro's schema compatibility in general. From the [specification](http://avro.apache.org/docs/current/spec.html#Schema+Resolution), looks like adding a new field w/o default will error out. ``` if the reader's record schema has a field with no default value, and writer's schema does not have a field with the same name, an error is signalled. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2664: [SUPPORT] Spark empty dataframe problem
nsivabalan commented on issue #2664: URL: https://github.com/apache/hudi/issues/2664#issuecomment-804178816 thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh
nsivabalan commented on issue #2680: URL: https://github.com/apache/hudi/issues/2680#issuecomment-804134521 @n3nash : Can you help here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2688: [SUPPORT] Sync to Hive using Metastore
nsivabalan commented on issue #2688: URL: https://github.com/apache/hudi/issues/2688#issuecomment-804132641 @n3nash : Can you help here or loop in someone who has exp w/ hive metastore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage
nsivabalan commented on issue #2692: URL: https://github.com/apache/hudi/issues/2692#issuecomment-804132173 @vburenin : Do you know of any such issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage
nsivabalan commented on issue #2692: URL: https://github.com/apache/hudi/issues/2692#issuecomment-804131933 interesting. I will have to try it out locally to reproduce. Will keep you posted. thanks for reporting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 closed pull request #2706: [RFC-20][HUDI-648]ERROR TABLE TEST CI
liujinhui1994 closed pull request #2706: URL: https://github.com/apache/hudi/pull/2706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 closed pull request #2706: [RFC-20][HUDI-648]ERROR TABLE TEST CI
garyli1019 closed pull request #2706: URL: https://github.com/apache/hudi/pull/2706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
nsivabalan commented on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-804036829 what was the issue or fix. Do you mind updating it here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2406: [SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS
nsivabalan commented on issue #2406: URL: https://github.com/apache/hudi/issues/2406#issuecomment-804036242 yes, as you could see from commit, it was merged 2 to 3 weeks back. We have an upcoming release in a week or two. So, you should have it in 0.8.0. If you want to verify the fix, you can pull in latest master and try it out. 0.7.0 does not have this fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] gopi-t2s commented on issue #2406: [SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS
gopi-t2s commented on issue #2406: URL: https://github.com/apache/hudi/issues/2406#issuecomment-804020555 Hi @nsivabalan, @SureshK-T2S and me working together to setup multi table delta streamer I downloaded the latest version(0.7.0) of hudi-utilities-bundle.jar from MAVEN reporisitory(https://mvnrepository.com/artifact/org.apache.hudi/hudi-utilities-bundle_2.11/0.7.0) and tried to run the spark-submit multi table delta streamer command without providing the schema provider class(hope this is not mandatory now after this fix #2577). But still receiving the same error mentioned above by Suresh. **ERROR LOG:** `Exception in thread "main" java.lang.NullPointerException at org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.populateSchemaProviderProps(HoodieMultiTableDeltaStreamer.java:150) at org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.populateTableExecutionContextList(HoodieMultiTableDeltaStreamer.java:130) at org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.(HoodieMultiTableDeltaStreamer.java:80) at org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.main(HoodieMultiTableDeltaStreamer.java:203)` **SPARK SUBMIT COMMAND** `spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer ` `ls ~/hudi/hudi-utilities-bundle_2.11-0.7.0.jar``\ --table-type COPY_ON_WRITE \ --props s3://path/s3_source.properties \ --config-folder s3://folder-path \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --source-ordering-field updated_at \ --base-path-prefix s3://object --target-table dummy_table --op UPSERT` Do I miss anything here or the above PR is not merged in 0.7.0 version maven jar. Could you share your valuable thoughts here. Thank you.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
Sugamber commented on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-804002453 I'm able to resolve class not found exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 opened a new pull request #2706: [RFC-20][HUDI-648]WIP
liujinhui1994 opened a new pull request #2706: URL: https://github.com/apache/hudi/pull/2706 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 closed pull request #2702: [HUDI-1710] Read optimized query type for Flink batch reader
danny0405 closed pull request #2702: URL: https://github.com/apache/hudi/pull/2702 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #2702: [HUDI-1710] Read optimized query type for Flink batch reader
danny0405 commented on pull request #2702: URL: https://github.com/apache/hudi/pull/2702#issuecomment-803993688 Close and reopen to re-trigger the CI tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2702: [HUDI-1710] Read optimized query type for Flink batch reader
codecov-io edited a comment on pull request #2702: URL: https://github.com/apache/hudi/pull/2702#issuecomment-803982515 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=h1) Report > Merging [#2702](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=desc) (701c84d) into [master](https://codecov.io/gh/apache/hudi/commit/ce3e8ec87083ef4cd4f33de39b6697f66ff3f277?el=desc) (ce3e8ec) will **decrease** coverage by `0.03%`. > The diff coverage is `50.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2702/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2702 +/- ## - Coverage 51.76% 51.73% -0.04% + Complexity 3602 3601 -1 Files 476 476 Lines 2257922592 +13 Branches 2408 2409 +1 - Hits 1168811687 -1 - Misses 9874 9886 +12 - Partials 1017 1019 +2 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `50.92% <ø> (-0.01%)` | `0.00 <ø> (ø)` | | | hudiflink | `54.13% <50.00%> (-0.15%)` | `0.00 <0.00> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `45.58% <ø> (-0.12%)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.73% <ø> (-0.06%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh) | `61.44% <50.00%> (-3.53%)` | `28.00 <0.00> (ø)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `71.37% <0.00%> (-0.35%)` | `55.00% <0.00%> (-1.00%)` | | | [...g/apache/hudi/common/config/LockConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Mb2NrQ29uZmlndXJhdGlvbi5qYXZh) | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | | | [...rg/apache/hudi/hive/HiveMetastoreLockProvider.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZU1ldGFzdG9yZUxvY2tQcm92aWRlci5qYXZh) | | | | | [...ache/hudi/hive/HiveMetastoreBasedLockProvider.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZU1ldGFzdG9yZUJhc2VkTG9ja1Byb3ZpZGVyLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (?%)` | | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2702: [HUDI-1710] Read optimized query type for Flink batch reader
codecov-io commented on pull request #2702: URL: https://github.com/apache/hudi/pull/2702#issuecomment-803982515 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=h1) Report > Merging [#2702](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=desc) (701c84d) into [master](https://codecov.io/gh/apache/hudi/commit/ce3e8ec87083ef4cd4f33de39b6697f66ff3f277?el=desc) (ce3e8ec) will **decrease** coverage by `1.51%`. > The diff coverage is `50.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2702/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2702 +/- ## - Coverage 51.76% 50.24% -1.52% + Complexity 3602 3216 -386 Files 476 418 -58 Lines 2257919411-3168 Branches 2408 2050 -358 - Hits 11688 9754-1934 + Misses 9874 8831-1043 + Partials 1017 826 -191 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `50.92% <ø> (-0.01%)` | `0.00 <ø> (ø)` | | | hudiflink | `54.13% <50.00%> (-0.15%)` | `0.00 <0.00> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.73% <ø> (-0.06%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2702?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh) | `61.44% <50.00%> (-3.53%)` | `28.00 <0.00> (ø)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `71.37% <0.00%> (-0.35%)` | `55.00% <0.00%> (-1.00%)` | | | [...g/apache/hudi/common/config/LockConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Mb2NrQ29uZmlndXJhdGlvbi5qYXZh) | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | | | [.../hive/SlashEncodedHourPartitionValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvU2xhc2hFbmNvZGVkSG91clBhcnRpdGlvblZhbHVlRXh0cmFjdG9yLmphdmE=) | | | | | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | | | | | [...va/org/apache/hudi/hive/util/ColumnNameXLator.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvdXRpbC9Db2x1bW5OYW1lWExhdG9yLmphdmE=) | | | | | [...rg/apache/hudi/hive/HiveMetastoreLockProvider.java](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZU1ldGFzdG9yZUxvY2tQcm92aWRlci5qYXZh) | | | | | [...g/apache/hudi/MergeOnReadIncrementalRelation.scala](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL01lcmdlT25SZWFkSW5jcmVtZW50YWxSZWxhdGlvbi5zY2FsYQ==) | | | | | [...in/scala/org/apache/hudi/HoodieEmptyRelation.scala](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUVtcHR5UmVsYXRpb24uc2NhbGE=) | | | | | [...in/scala/org/apache/hudi/IncrementalRelation.scala](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0luY3JlbWVudGFsUmVsYXRpb24uc2NhbGE=) | | | | | ... and [51 more](https://codecov.io/gh/apache/hudi/pull/2702/diff?src=pr=tree-more) | | -- This is an
[GitHub] [hudi] cdmikechen opened a new issue #2705: [SUPPORT] Can not read data schema using Spark3.0.2 on k8s with hudi-utilities (build in 2.12 and spark3)
cdmikechen opened a new issue #2705: URL: https://github.com/apache/hudi/issues/2705 **Describe the problem you faced** I use spark operator on openshift 4.6 to receive Kafka data and insert data to hudi table. I use `hudi-utilities_2.12` (maven build in 2.12 and spark3), and use debezium to read mysql binlog. When spark read kafka data, It shows the following error in *Stacktrace* I don't know if this is bug in hudi 0.7.0 with spark3 or spark3 has a problem with the structure of avro. The same program can be run in Hudi 0.6.0 based on spark on yarn. the debezium avro schema is ```json [{ "type": "record", "name": "hoodie_source", "namespace": "hoodie.source", "fields": [{ "name": "before", "type": [{ "type": "record", "name": "before", "namespace": "hoodie.source.hoodie_source", "fields": [{ "name": "id", "type": "int" }, { "name": "name", "type": ["string", "null"] }, { "name": "type", "type": ["string", "null"] }, { "name": "url", "type": ["string", "null"] }, { "name": "user", "type": ["string", "null"] }, { "name": "password", "type": ["string", "null"] }, { "name": "create_time", "type": ["string", "null"] }, { "name": "create_user", "type": ["string", "null"] }, { "name": "update_time", "type": ["string", "null"] }, { "name": "update_user", "type": ["string", "null"] }, { "name": "del_flag", "type": ["int", "null"] }] }, "null"] }, { "name": "after", "type": [{ "type": "record", "name": "after", "namespace": "hoodie.source.hoodie_source", "fields": [{ "name": "id", "type": "int" }, { "name": "name", "type": ["string", "null"] }, { "name": "type", "type": ["string", "null"] }, { "name": "url", "type": ["string", "null"] }, { "name": "user", "type": ["string", "null"] }, { "name": "password", "type": ["string", "null"] }, { "name": "create_time", "type": ["string", "null"] }, { "name": "create_user", "type": ["string", "null"] }, { "name": "update_time", "type": ["string", "null"] }, { "name": "update_user", "type": ["string", "null"] }, { "name": "del_flag", "type": ["int", "null"] }] }, "null"] }, { "name": "source", "type": { "type": "record", "name": "source", "namespace": "hoodie.source.hoodie_source", "fields": [{ "name": "version", "type": "string" }, { "name": "connector", "type": "string" }, { "name": "name", "type": "string" }, {
[GitHub] [hudi] liujinhui1994 closed pull request #2704: [RFC-20] WIP
liujinhui1994 closed pull request #2704: URL: https://github.com/apache/hudi/pull/2704 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 opened a new pull request #2704: [RFC-20] WIP
liujinhui1994 opened a new pull request #2704: URL: https://github.com/apache/hudi/pull/2704 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash merged pull request #2699: [HUDI-1709] Improving lock config names and adding hive metastore uri config
n3nash merged pull request #2699: URL: https://github.com/apache/hudi/pull/2699 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1709] Improving config names and adding hive metastore uri config (#2699)
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new d7b1878 [HUDI-1709] Improving config names and adding hive metastore uri config (#2699) d7b1878 is described below commit d7b18783bdd6edd6355ee68714982401d3321f86 Author: n3nash AuthorDate: Mon Mar 22 01:22:06 2021 -0700 [HUDI-1709] Improving config names and adding hive metastore uri config (#2699) --- .../transaction/lock/ZookeeperBasedLockProvider.java | 3 ++- .../java/org/apache/hudi/config/HoodieLockConfig.java| 15 +++ .../org/apache/hudi/common/config/LockConfiguration.java | 11 +++ .../hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java | 2 +- ...Provider.java => HiveMetastoreBasedLockProvider.java} | 16 +++- ...ider.java => TestHiveMetastoreBasedLockProvider.java} | 10 +- .../utilities/functional/TestHoodieDeltaStreamer.java| 2 +- 7 files changed, 42 insertions(+), 17 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/ZookeeperBasedLockProvider.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/ZookeeperBasedLockProvider.java index 60336c5..8a80685 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/ZookeeperBasedLockProvider.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/ZookeeperBasedLockProvider.java @@ -39,6 +39,7 @@ import java.util.concurrent.TimeUnit; import static org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_CONNECTION_TIMEOUT_MS; import static org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_SESSION_TIMEOUT_MS; import static org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_NUM_RETRIES_PROP; +import static org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_MAX_WAIT_TIME_IN_MILLIS_PROP; import static org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP; import static org.apache.hudi.common.config.LockConfiguration.ZK_BASE_PATH_PROP; import static org.apache.hudi.common.config.LockConfiguration.ZK_CONNECTION_TIMEOUT_MS_PROP; @@ -65,7 +66,7 @@ public class ZookeeperBasedLockProvider implements LockProvider { +public class HiveMetastoreBasedLockProvider implements LockProvider { - private static final Logger LOG = LogManager.getLogger(HiveMetastoreLockProvider.class); + private static final Logger LOG = LogManager.getLogger(HiveMetastoreBasedLockProvider.class); private final String databaseName; private final String tableName; + private final String hiveMetastoreUris; private IMetaStoreClient hiveClient; private volatile LockResponse lock = null; protected LockConfiguration lockConfiguration; ExecutorService executor = Executors.newSingleThreadExecutor(); - public HiveMetastoreLockProvider(final LockConfiguration lockConfiguration, final Configuration conf) { + public HiveMetastoreBasedLockProvider(final LockConfiguration lockConfiguration, final Configuration conf) { this(lockConfiguration); try { HiveConf hiveConf = new HiveConf(); @@ -91,16 +93,17 @@ public class HiveMetastoreLockProvider implements LockProvider { } } - public HiveMetastoreLockProvider(final LockConfiguration lockConfiguration, final IMetaStoreClient metaStoreClient) { + public HiveMetastoreBasedLockProvider(final LockConfiguration lockConfiguration, final IMetaStoreClient metaStoreClient) { this(lockConfiguration); this.hiveClient = metaStoreClient; } - HiveMetastoreLockProvider(final LockConfiguration lockConfiguration) { + HiveMetastoreBasedLockProvider(final LockConfiguration lockConfiguration) { checkRequiredProps(lockConfiguration); this.lockConfiguration = lockConfiguration; this.databaseName = this.lockConfiguration.getConfig().getString(HIVE_DATABASE_NAME_PROP); this.tableName = this.lockConfiguration.getConfig().getString(HIVE_TABLE_NAME_PROP); +this.hiveMetastoreUris = this.lockConfiguration.getConfig().getOrDefault(HIVE_METASTORE_URI_PROP, "").toString(); } @Override @@ -206,6 +209,9 @@ public class HiveMetastoreLockProvider implements LockProvider { } private void setHiveLockConfs(HiveConf hiveConf) { +if (!StringUtils.isNullOrEmpty(this.hiveMetastoreUris)) { + hiveConf.setVar(HiveConf.ConfVars.METASTOREURIS, this.hiveMetastoreUris); +} hiveConf.set("hive.support.concurrency", "true"); hiveConf.set("hive.lock.manager", "org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager"); hiveConf.set("hive.lock.numretries", lockConfiguration.getConfig().getString(LOCK_ACQUIRE_NUM_RETRIES_PROP)); diff --git
[hudi] branch asf-site updated: [HUDI-1679] Concurrency Control in Hudi (#2698)
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new e5caa41 [HUDI-1679] Concurrency Control in Hudi (#2698) e5caa41 is described below commit e5caa41de16cfe3213e638237e0dff46ddf1bf96 Author: n3nash AuthorDate: Mon Mar 22 01:20:15 2021 -0700 [HUDI-1679] Concurrency Control in Hudi (#2698) --- docs/_data/navigation.yml | 2 + docs/_docs/2_4_configurations.md | 63 ++ docs/_docs/2_9_concurrency_control.md | 151 ++ 3 files changed, 216 insertions(+) diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml index 5803a43..114bed3 100644 --- a/docs/_data/navigation.yml +++ b/docs/_data/navigation.yml @@ -28,6 +28,8 @@ docs: url: /docs/use_cases.html - title: "Writing Data" url: /docs/writing_data.html + - title: "Concurrency Control" +url: /docs/concurrency_control.html - title: "Querying Data" url: /docs/querying_data.html - title: "Configuration" diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md index ec35e64..e176550 100644 --- a/docs/_docs/2_4_configurations.md +++ b/docs/_docs/2_4_configurations.md @@ -824,3 +824,66 @@ Property: `hoodie.write.commit.callback.kafka.acks` # CALLBACK_KAFKA_RETRIES Property: `hoodie.write.commit.callback.kafka.retries` Times to retry. 3 by default + +### Locking configs +Configs that control locking mechanisms if [WriteConcurrencyMode=optimistic_concurrency_control](#WriteConcurrencyMode) is enabled +[withLockConfig](#withLockConfig) (HoodieLockConfig) + + withLockProvider(lockProvider = org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider) {#withLockProvider} +Property: `hoodie.writer.lock.provider` +Lock provider class name, user can provide their own implementation of LockProvider which should be subclass of org.apache.hudi.common.lock.LockProvider + + withZkQuorum(zkQuorum) {#withZkQuorum} +Property: `hoodie.writer.lock.zookeeper.url` +Set the list of comma separated servers to connect to + + withZkBasePath(zkBasePath) {#withZkBasePath} +Property: `hoodie.writer.lock.zookeeper.base_path` [Required] +The base path on Zookeeper under which to create a ZNode to acquire the lock. This should be common for all jobs writing to the same table + + withZkPort(zkPort) {#withZkPort} +Property: `hoodie.writer.lock.zookeeper.port` [Required] +The connection port to be used for Zookeeper + + withZkLockKey(zkLockKey) {#withZkLockKey} +Property: `hoodie.writer.lock.zookeeper.lock_key` [Required] +Key name under base_path at which to create a ZNode and acquire lock. Final path on zk will look like base_path/lock_key. We recommend setting this to the table name + + withZkConnectionTimeoutInMs(connectionTimeoutInMs = 15000) {#withZkConnectionTimeoutInMs} +Property: `hoodie.writer.lock.zookeeper.connection_timeout_ms` +How long to wait when connecting to ZooKeeper before considering the connection a failure + + withZkSessionTimeoutInMs(sessionTimeoutInMs = 6) {#withZkSessionTimeoutInMs} +Property: `hoodie.writer.lock.zookeeper.session_timeout_ms` +How long to wait after losing a connection to ZooKeeper before the session is expired + + withNumRetries(num_retries = 3) {#withNumRetries} +Property: `hoodie.writer.lock.num_retries` +Maximum number of times to retry by lock provider client + + withRetryWaitTimeInMillis(retryWaitTimeInMillis = 5000) {#withRetryWaitTimeInMillis} +Property: `hoodie.writer.lock.wait_time_ms_between_retry` +Initial amount of time to wait between retries by lock provider client + + withHiveDatabaseName(hiveDatabaseName) {#withHiveDatabaseName} +Property: `hoodie.writer.lock.hivemetastore.database` [Required] +The Hive database to acquire lock against + + withHiveTableName(hiveTableName) {#withHiveTableName} +Property: `hoodie.writer.lock.hivemetastore.table` [Required] +The Hive table under the hive database to acquire lock against + + withClientNumRetries(clientNumRetries = 0) {#withClientNumRetries} +Property: `hoodie.writer.lock.client.num_retries` +Maximum number of times to retry to acquire lock additionally from the hudi client + + withRetryWaitTimeInMillis(retryWaitTimeInMillis = 1) {#withRetryWaitTimeInMillis} +Property: `hoodie.writer.lock.client.wait_time_ms_between_retry` +Amount of time to wait between retries from the hudi client + + withConflictResolutionStrategy(lockProvider = org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy) {#withConflictResolutionStrategy} +Property: `hoodie.writer.lock.conflict.resolution.strategy` +Lock provider class name, this should be subclass of
[GitHub] [hudi] n3nash merged pull request #2698: [HUDI-1679] Concurrency Control in Hudi
n3nash merged pull request #2698: URL: https://github.com/apache/hudi/pull/2698 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2698: [HUDI-1679] Concurrency Control in Hudi
n3nash commented on pull request #2698: URL: https://github.com/apache/hudi/pull/2698#issuecomment-803862184 @nsivabalan Yes, I already built it locally to confirm rendering. Attaching pictures for future reference. https://user-images.githubusercontent.com/2722167/111960187-a5775000-8aac-11eb-8c65-91dea58f4f7f.png;> https://user-images.githubusercontent.com/2722167/111960199-a9a36d80-8aac-11eb-8265-5a3edcecf44a.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] youngyangp closed issue #2700: [SUPPORT] Hive sync error by using DataFrameWriter
youngyangp closed issue #2700: URL: https://github.com/apache/hudi/issues/2700 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] legendtkl opened a new pull request #2703: [DOCUMENT] update README doc for integ test
legendtkl opened a new pull request #2703: URL: https://github.com/apache/hudi/pull/2703 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org