[GitHub] [hudi] bvaradar commented on pull request #2084: [HUDI-802] AWSDmsTransformer does not handle insert and delete of a row in a single batch correctly
bvaradar commented on pull request #2084: URL: https://github.com/apache/hudi/pull/2084#issuecomment-690863065 @nsivabalan : Can you please review this. Thanks, Balaji.V This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar opened a new pull request #2084: [HUDI-802] AWSDmsTransformer does not handle insert and delete of a row in a single batch correctly
bvaradar opened a new pull request #2084: URL: https://github.com/apache/hudi/pull/2084 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Reopened] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly
[ https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reopened HUDI-802: - > AWSDmsTransformer does not handle insert -> delete of a row in a single batch > correctly > --- > > Key: HUDI-802 > URL: https://issues.apache.org/jira/browse/HUDI-802 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Christopher Weaver >Assignee: Balaji Varadarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.0 > > > The provided AWSDmsAvroPayload class > ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java]) > currently handles cases where the "Op" column is a "D" for updates, and > successfully removes the row from the resulting table. > However, when an insert is quickly followed by a delete on the row (e.g. DMS > processes them together and puts the update records together in the same > parquet file), the row incorrectly appears in the resulting table. In this > case, the record is not in the table and getInsertValue is called rather than > combineAndGetUpdateValue. Since the logic to check for a delete is in > combineAndGetUpdateValue, it is skipped and the delete is missed. Something > like this could fix this issue: > [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly
[ https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-802: Fix Version/s: (was: 0.6.0) 0.6.1 > AWSDmsTransformer does not handle insert -> delete of a row in a single batch > correctly > --- > > Key: HUDI-802 > URL: https://issues.apache.org/jira/browse/HUDI-802 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Christopher Weaver >Assignee: Balaji Varadarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.1 > > > The provided AWSDmsAvroPayload class > ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java]) > currently handles cases where the "Op" column is a "D" for updates, and > successfully removes the row from the resulting table. > However, when an insert is quickly followed by a delete on the row (e.g. DMS > processes them together and puts the update records together in the same > parquet file), the row incorrectly appears in the resulting table. In this > case, the record is not in the table and getInsertValue is called rather than > combineAndGetUpdateValue. Since the logic to check for a delete is in > combineAndGetUpdateValue, it is skipped and the delete is missed. Something > like this could fix this issue: > [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yanghua closed pull request #2058: [HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env
yanghua closed pull request #2058: URL: https://github.com/apache/hudi/pull/2058 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on pull request #2058: [HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env
yanghua commented on pull request #2058: URL: https://github.com/apache/hudi/pull/2058#issuecomment-690838645 > @yanghua : You can look at the docker compose file and https://github.com/apache/hudi/blob/master/docker/setup_demo.sh We mount hudi workspace inside docker to achieve it. > > I guess, we can close this PR then ? Got it. IIUC, you mean: ``` volumes: - ${HUDI_WS}:/var/hoodie/ws ``` > I guess, we can close this PR then ? Yes. Actually, My colleagues also do not know it. We rarely use docker. IMO, it would be better to describe it into the documentation. wdyt? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly
[ https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193958#comment-17193958 ] Balaji Varadarajan commented on HUDI-802: - Thanks [~Weves] for the ticket. I will get your code changes landed. > AWSDmsTransformer does not handle insert -> delete of a row in a single batch > correctly > --- > > Key: HUDI-802 > URL: https://issues.apache.org/jira/browse/HUDI-802 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Christopher Weaver >Assignee: Balaji Varadarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.0 > > > The provided AWSDmsAvroPayload class > ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java]) > currently handles cases where the "Op" column is a "D" for updates, and > successfully removes the row from the resulting table. > However, when an insert is quickly followed by a delete on the row (e.g. DMS > processes them together and puts the update records together in the same > parquet file), the row incorrectly appears in the resulting table. In this > case, the record is not in the table and getInsertValue is called rather than > combineAndGetUpdateValue. Since the logic to check for a delete is in > combineAndGetUpdateValue, it is skipped and the delete is missed. Something > like this could fix this issue: > [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly
[ https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-802: --- Assignee: Balaji Varadarajan (was: sivabalan narayanan) > AWSDmsTransformer does not handle insert -> delete of a row in a single batch > correctly > --- > > Key: HUDI-802 > URL: https://issues.apache.org/jira/browse/HUDI-802 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Christopher Weaver >Assignee: Balaji Varadarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.0 > > > The provided AWSDmsAvroPayload class > ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java]) > currently handles cases where the "Op" column is a "D" for updates, and > successfully removes the row from the resulting table. > However, when an insert is quickly followed by a delete on the row (e.g. DMS > processes them together and puts the update records together in the same > parquet file), the row incorrectly appears in the resulting table. In this > case, the record is not in the table and getInsertValue is called rather than > combineAndGetUpdateValue. Since the logic to check for a delete is in > combineAndGetUpdateValue, it is skipped and the delete is missed. Something > like this could fix this issue: > [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] hj2016 commented on a change in pull request #2078: [MINOR]Add clinbrain to powered by page
hj2016 commented on a change in pull request #2078: URL: https://github.com/apache/hudi/pull/2078#discussion_r486733426 ## File path: docs/_docs/1_4_powered_by.md ## @@ -28,6 +29,9 @@ offering real-time analysis on hudi dataset. Amazon Web Services is the World's leading cloud services provider. Apache Hudi is [pre-installed](https://aws.amazon.com/emr/features/hudi/) with the AWS Elastic Map Reduce offering, providing means for AWS users to perform record-level updates/deletes and manage storage efficiently. +### Clinbrain +[Clinbrain](https://www.clinbrain.com/) is the leading of big data platform on medical industry, we have built 200 medical big data centers by integrating Hudi Data Lake solution in numerous hospitals, hudi provides the abablility to upsert and deletes on hdfs, at the same time, it can make the fresh data-stream up-to-date effcienctlly in hadoop system with the hudi incremental view. Review comment: I modified it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rafaelhbarros opened a new issue #2083: Kafka readStream performance slow [SUPPORT]
rafaelhbarros opened a new issue #2083: URL: https://github.com/apache/hudi/issues/2083 **Describe the problem you faced** I have a kafka topic that produces 1-2 million records per minute. I'm trying to write these records to s3 in the hudi format. I can't get it to keep up with the input. I'm running on EMR, m5.xlarge driver, 3x c5.xlarge core instances. The data is serialized in avro, and deserialized with schema registry (using abris). **Environment Description** ```spark-submit \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --master yarn \ --name hudi-consumer \ --deploy-mode cluster \ --conf spark.yarn.submit.waitAppCompletion=false \ --conf spark.scheduler.mode=FAIR \ --conf spark.task.maxFailures=10 \ --conf spark.memory.fraction=0.4 \ --conf spark.rdd.compress=true \ --conf spark.kryoserializer.buffer.max=512m \ --conf spark.memory.storageFraction=0.1 \ --conf spark.shuffle.service.enabled=true \ --conf spark.sql.hive.convertMetastoreParquet=false \ --conf spark.driver.maxResultSize=3g \ --conf spark.yarn.max.executor.failures=10 \ --conf spark.file.partitions=10 \ --conf spark.sql.shuffle.partitions=80 \ --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:+ExitOnOutOfMemoryError" \ --conf spark.driver.extraJavaOptions="-XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" \ --driver-memory 4G \ --executor-memory 5G \ --executor-cores 4 \ --num-executors 6 \ --class ``` Hudi confs: ``` hoodie.combine.before.upsert=false hoodie.bulkinsert.shuffle.parallelism=10 hoodie.insert.shuffle.parallelism=10 hoodie.upsert.shuffle.parallelism=10 hoodie.delete.shuffle.parallelism=1 TABLE_TYPE_OPT_KEY()=COW_TABLE_TYPE_OPT_VAL() ``` * Hudi version : 0.5.2-incubating * Spark version : 2.4.4 (scala 2.12, emr 6.0.0) * Hive version : N/A * Hadoop version : 3.2.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action
satishkotha commented on pull request #2048: URL: https://github.com/apache/hudi/pull/2048#issuecomment-690711545 @vinothchandar As discussed, i added boolean in WriteStatus and removed HoodieReplaceStat. See this [diff](https://github.com/apache/hudi/pull/2048/commits/94b275dbd20ec82ebe568b47bb28447d92ab996f). I committed it as a separate git-sha because this still looks somewhat awkward IMO. Please take a look and I can revert or reimplement in a different way Also, created https://issues.apache.org/jira/browse/HUDI-1276 for cleaning replaced file during clean. I also renamed 'replace' to 'replacecommit' everywhere as you suggested. Please let me know if you have additional comments/suggestions This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1266) Add e2e integration tests for replace and insert-overwrite
[ https://issues.apache.org/jira/browse/HUDI-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish updated HUDI-1266: - Fix Version/s: 0.7.0 > Add e2e integration tests for replace and insert-overwrite > -- > > Key: HUDI-1266 > URL: https://issues.apache.org/jira/browse/HUDI-1266 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1264) incremental read support with replace
[ https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish updated HUDI-1264: - Fix Version/s: 0.7.0 > incremental read support with replace > - > > Key: HUDI-1264 > URL: https://issues.apache.org/jira/browse/HUDI-1264 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Fix For: 0.7.0 > > > initial version, we could fail incremental reads if there is a REPLACE > instant. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1262) Documentation Update for Insert Overwrite
[ https://issues.apache.org/jira/browse/HUDI-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish updated HUDI-1262: - Fix Version/s: 0.7.0 > Documentation Update for Insert Overwrite > - > > Key: HUDI-1262 > URL: https://issues.apache.org/jira/browse/HUDI-1262 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Fix For: 0.7.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1261) CLI tools update to support REPLACE and insert overwrite
[ https://issues.apache.org/jira/browse/HUDI-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish updated HUDI-1261: - Fix Version/s: 0.7.0 > CLI tools update to support REPLACE and insert overwrite > > > Key: HUDI-1261 > URL: https://issues.apache.org/jira/browse/HUDI-1261 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Fix For: 0.7.0 > > > We introduced replace as part of https://github.com/apache/hudi/pull/2048, we > need to change CLI tools inspect REPLACE metadata files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1263) DeltaStreamer changes to support insert overwrite and replace
[ https://issues.apache.org/jira/browse/HUDI-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish updated HUDI-1263: - Fix Version/s: 0.7.0 > DeltaStreamer changes to support insert overwrite and replace > - > > Key: HUDI-1263 > URL: https://issues.apache.org/jira/browse/HUDI-1263 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Fix For: 0.7.0 > > > Follow up from https://github.com/apache/hudi/pull/2048, we want to add delta > streamer support for replace -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1260) Reader changes to supportinsert overwrite
[ https://issues.apache.org/jira/browse/HUDI-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish updated HUDI-1260: - Fix Version/s: 0.7.0 > Reader changes to supportinsert overwrite > - > > Key: HUDI-1260 > URL: https://issues.apache.org/jira/browse/HUDI-1260 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Fix For: 0.7.0 > > > Same as HUDI-1072, but creating subtask for insert overwrite -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1276) delete replaced file groups during clean
[ https://issues.apache.org/jira/browse/HUDI-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish updated HUDI-1276: - Description: We clean replaced file groups during archival as part of PR#2048. But we may want do this during clean stage to prevent storage overhead (was: Same as HUDI-1072, but creating subtask for insert overwrite) > delete replaced file groups during clean > > > Key: HUDI-1276 > URL: https://issues.apache.org/jira/browse/HUDI-1276 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > > We clean replaced file groups during archival as part of PR#2048. But we may > want do this during clean stage to prevent storage overhead -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1276) delete replaced file groups during clean
[ https://issues.apache.org/jira/browse/HUDI-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] satish updated HUDI-1276: - Fix Version/s: 0.7.0 > delete replaced file groups during clean > > > Key: HUDI-1276 > URL: https://issues.apache.org/jira/browse/HUDI-1276 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Fix For: 0.7.0 > > > We clean replaced file groups during archival as part of PR#2048. But we may > want do this during clean stage to prevent storage overhead -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1276) delete replaced file groups during clean
satish created HUDI-1276: Summary: delete replaced file groups during clean Key: HUDI-1276 URL: https://issues.apache.org/jira/browse/HUDI-1276 Project: Apache Hudi Issue Type: Sub-task Reporter: satish Assignee: satish Same as HUDI-1072, but creating subtask for insert overwrite -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] abhijeetkushe commented on issue #1737: [SUPPORT]spark streaming create small parquet files
abhijeetkushe commented on issue #1737: URL: https://github.com/apache/hudi/issues/1737#issuecomment-690685990 I am facing a similar problem.I am doing a POC for Hudi and am using with the same data for both COW and MOR.I see the compaction happening for both table types as new versions of the same file are created.But the cleanup only happens for COW This the config which works for COW but not for MOR 'hoodie.parquet.small.file.limit': '104857600', 'hoodie.compact.inline': True, 'hoodie.cleaner.commits.retained': 1, What are the values required for the below setting 'hoodie.logfile.max.size': '1048576', 'hoodie.logfile.to.parquet.compression.ratio': 0.35, This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashanthvg89 commented on issue #2065: [SUPPORT] Intermittent IllegalArgumentException while saving to Hudi dataset from Spark streaming job
prashanthvg89 commented on issue #2065: URL: https://github.com/apache/hudi/issues/2065#issuecomment-690675022 So far it's running good. Usually, it used to fail after 2 days previously and now it's been close to that and no errors. I'll update by end of this week This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #1929: [HUDI-1160] Support update partial fields for CoW table
satishkotha commented on a change in pull request #1929: URL: https://github.com/apache/hudi/pull/1929#discussion_r486525140 ## File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -94,6 +95,11 @@ public static final String BULKINSERT_SORT_MODE = "hoodie.bulkinsert.sort.mode"; public static final String DEFAULT_BULKINSERT_SORT_MODE = BulkInsertSortMode.GLOBAL_SORT .toString(); + public static final String DELETE_MARKER_FIELD_PROP = "hoodie.write.delete.marker.field"; Review comment: Is this needed for this change? what is this used for? ## File path: hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -117,7 +118,17 @@ public boolean commitStats(String instantTime, List stats, Opti if (extraMetadata.isPresent()) { extraMetadata.get().forEach(metadata::addMetadata); } -metadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, config.getSchema()); +String schema = config.getSchema(); +if (config.updatePartialFields()) { + try { +TableSchemaResolver resolver = new TableSchemaResolver(table.getMetaClient()); +schema = resolver.getTableAvroSchemaWithoutMetadataFields().toString(); + } catch (Exception e) { +// ignore exception. +schema = config.getSchema(); Review comment: We are potentially reducing schema here, so I think this can lead to issues. Can we throw error? At the least, can you add a LOG here to make sure this gets noticed? ## File path: hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -117,7 +118,17 @@ public boolean commitStats(String instantTime, List stats, Opti if (extraMetadata.isPresent()) { extraMetadata.get().forEach(metadata::addMetadata); } -metadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, config.getSchema()); +String schema = config.getSchema(); +if (config.updatePartialFields()) { + try { +TableSchemaResolver resolver = new TableSchemaResolver(table.getMetaClient()); Review comment: Do you need to create resolver again? Does config.getLastSchema() work here? ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/commit/MergeHelper.java ## @@ -73,7 +74,11 @@ } else { gReader = null; gWriter = null; - readSchema = upsertHandle.getWriterSchemaWithMetafields(); + if (table.getConfig().updatePartialFields() && !StringUtils.isNullOrEmpty(table.getConfig().getLastSchema())) { +readSchema = new Schema.Parser().parse(table.getConfig().getLastSchema()); Review comment: similar comment as before. if we make config.getSchema() to always track full table schema, this can be simplified. ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java ## @@ -90,9 +92,19 @@ protected HoodieWriteHandle(HoodieWriteConfig config, String instantTime, String * @param config Write Config * @return */ - protected static Pair getWriterSchemaIncludingAndExcludingMetadataPair(HoodieWriteConfig config) { + protected static Pair getWriterSchemaIncludingAndExcludingMetadataPair(HoodieWriteConfig config, HoodieTable hoodieTable) { Schema originalSchema = new Schema.Parser().parse(config.getSchema()); Schema hoodieSchema = HoodieAvroUtils.addMetadataFields(originalSchema); +boolean updatePartialFields = config.updatePartialFields(); +if (updatePartialFields) { + try { +TableSchemaResolver resolver = new TableSchemaResolver(hoodieTable.getMetaClient()); Review comment: This is only applicable for MergeHandle if i understand correctly. Do you think its better to override this in MergeHandle? ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/commit/MergeHelper.java ## @@ -73,7 +74,11 @@ } else { gReader = null; gWriter = null; - readSchema = upsertHandle.getWriterSchemaWithMetafields(); + if (table.getConfig().updatePartialFields() && !StringUtils.isNullOrEmpty(table.getConfig().getLastSchema())) { +readSchema = new Schema.Parser().parse(table.getConfig().getLastSchema()); + } else { +readSchema = upsertHandle.getWriterSchemaWithMetafields(); Review comment: we are also calling getWriterSchemaWithMetafields in other places in this class (example: line 163). Dont we need to read getLastSchema() there? ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java ## @@ -237,6 +238,9 @@ protected void finalizeWrite(String instantTime, List stats, Ho * By default, return the writer schema in Write Config for storing in commit. */ protected String getSchemaToStoreInCommit() { +if (config.updatePartialFields() && !StringUtils.isNullOrEmpty(config.getLas
[GitHub] [hudi] bvaradar commented on pull request #1524: [HUDI-801] Adding a way to post process schema after it is fetched
bvaradar commented on pull request #1524: URL: https://github.com/apache/hudi/pull/1524#issuecomment-690554395 @pratyakshsharma : I will help getting this landed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on pull request #2046: [HUDI-1230] Fix for preventing MOR datasource jobs from hanging via spark-submit
bvaradar commented on pull request #2046: URL: https://github.com/apache/hudi/pull/2046#issuecomment-690551558 @umehrot2 : Once you add test, please ping me. Thanks, Balaji.V This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on pull request #2058: [HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env
bvaradar commented on pull request #2058: URL: https://github.com/apache/hudi/pull/2058#issuecomment-690546404 @yanghua : You can look at the docker compose file and https://github.com/apache/hudi/blob/master/docker/setup_demo.sh We mount hudi workspace inside docker to achieve it. I guess, we can close this PR then ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu edited a comment on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-690335395 @vinothchandar @yanghua @leesf The ci is green now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-690335395 @vinothchandar @yanghua @leesf The ci passed now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bradleyhurley commented on issue #2068: [SUPPORT]Deltastreamer Upsert Very Slow / Never Completes After Initial Data Load
bradleyhurley commented on issue #2068: URL: https://github.com/apache/hudi/issues/2068#issuecomment-690292104 Thanks @bvaradar - I think I have read most of the guides and documentation that I could find. Is there a formula that should drive the number of executors, cores per executor, driver memory, and executor memory? With a properly sized configuration do you have a ballpark of how long you would expect it to take to upsert 100M rows into a Hudi table with 100M existing rows with 99%+ of the data being an insert vs update? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu removed a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu removed a comment on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-689916910 > @wangxianghu the issue with the tests is that, now most of the tests are moved to hudi-spark-client. previously we had split tests into hudi-client and others. We need to edit `travis.yml` to adjust the splits again @vinothchandar could you please help me edit travis.yml to adjust the splits .. I am not familiar with that thanks :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #1524: [HUDI-801] Adding a way to post process schema after it is fetched
pratyakshsharma commented on pull request #1524: URL: https://github.com/apache/hudi/pull/1524#issuecomment-690099859 @afilipchik still working on this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: Travis CI build asf-site
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 1303b2d Travis CI build asf-site 1303b2d is described below commit 1303b2dfee4e8396c7e5314f55b08d34a6f7b21c Author: CI AuthorDate: Thu Sep 10 09:02:36 2020 + Travis CI build asf-site --- content/community.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/community.html b/content/community.html index 0e4f06a..242681b 100644 --- a/content/community.html +++ b/content/community.html @@ -347,7 +347,7 @@ Committers are chosen by a majority vote of the Apache Hudi https://www https://avatars.githubusercontent.com/pratyakshsharma"; style="max-width: 100px" alt="pratyakshsharma" align="middle" /> https://github.com/pratyakshsharma";>Pratyaksh Sharma Committer - pratyaksh13 + pratyakshsharma https://avatars.githubusercontent.com/xushiyan"; style="max-width: 100px" alt="xushiyan" align="middle" />
[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2078: [MINOR]Add clinbrain to powered by page
pratyakshsharma commented on a change in pull request #2078: URL: https://github.com/apache/hudi/pull/2078#discussion_r486176468 ## File path: docs/_docs/1_4_powered_by.md ## @@ -28,6 +29,9 @@ offering real-time analysis on hudi dataset. Amazon Web Services is the World's leading cloud services provider. Apache Hudi is [pre-installed](https://aws.amazon.com/emr/features/hudi/) with the AWS Elastic Map Reduce offering, providing means for AWS users to perform record-level updates/deletes and manage storage efficiently. +### Clinbrain +[Clinbrain](https://www.clinbrain.com/) is the leading of big data platform on medical industry, we have built 200 medical big data centers by integrating Hudi Data Lake solution in numerous hospitals, hudi provides the abablility to upsert and deletes on hdfs, at the same time, it can make the fresh data-stream up-to-date effcienctlly in hadoop system with the hudi incremental view. Review comment: 1. is the leading of big data platform on medical industry -> is the leader of big data platform and usage in medical industry. 2. industry, we have -> industry. We have ## File path: docs/_docs/1_4_powered_by.md ## @@ -28,6 +29,9 @@ offering real-time analysis on hudi dataset. Amazon Web Services is the World's leading cloud services provider. Apache Hudi is [pre-installed](https://aws.amazon.com/emr/features/hudi/) with the AWS Elastic Map Reduce offering, providing means for AWS users to perform record-level updates/deletes and manage storage efficiently. +### Clinbrain +[Clinbrain](https://www.clinbrain.com/) is the leading of big data platform on medical industry, we have built 200 medical big data centers by integrating Hudi Data Lake solution in numerous hospitals, hudi provides the abablility to upsert and deletes on hdfs, at the same time, it can make the fresh data-stream up-to-date effcienctlly in hadoop system with the hudi incremental view. Review comment: 3. hospitals, hudi -> hospitals. Hudi 4. abablility -> ability 5. deletes -> delete 6. effcienctlly -> efficiently This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: changed apache id for Pratyaksh
This is an automated email from the ASF dual-hosted git repository. pratyakshsharma pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 3a368b0 changed apache id for Pratyaksh new 18943b0 Merge pull request #2080 from pratyakshsharma/team-asf-site 3a368b0 is described below commit 3a368b096e54bce55edb5412e4ff49a078156810 Author: pratyakshsharma AuthorDate: Thu Sep 10 01:12:34 2020 +0530 changed apache id for Pratyaksh --- docs/_pages/community.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_pages/community.md b/docs/_pages/community.md index 2f90b3c..30e3085 100644 --- a/docs/_pages/community.md +++ b/docs/_pages/community.md @@ -61,7 +61,7 @@ Committers are chosen by a majority vote of the Apache Hudi [PMC](https://www.ap | https://avatars.githubusercontent.com/lamber-ken"; alt="lamber-ken" style="max-width: 100px;" align="middle" /> | [lamber-ken](https://github.com/lamber-ken) | Committer | lamberken | | https://avatars.githubusercontent.com/n3nash"; style="max-width: 100px" alt="n3nash" align="middle" /> | [Nishith Agarwal](https://github.com/n3nash) | PMC, Committer | nagarwal | | https://avatars.githubusercontent.com/prasannarajaperumal"; style="max-width: 100px" alt="prasannarajaperumal" align="middle" /> | [Prasanna Rajaperumal](https://github.com/prasannarajaperumal) | PMC, Committer | prasanna | -| https://avatars.githubusercontent.com/pratyakshsharma"; style="max-width: 100px" alt="pratyakshsharma" align="middle" /> | [Pratyaksh Sharma](https://github.com/pratyakshsharma) | Committer | pratyaksh13| +| https://avatars.githubusercontent.com/pratyakshsharma"; style="max-width: 100px" alt="pratyakshsharma" align="middle" /> | [Pratyaksh Sharma](https://github.com/pratyakshsharma) | Committer | pratyakshsharma| | https://avatars.githubusercontent.com/xushiyan"; style="max-width: 100px" alt="xushiyan" align="middle" /> | [Raymond Xu](https://github.com/xushiyan) | Committer | xushiyan| | https://avatars.githubusercontent.com/leesf"; style="max-width: 100px" alt="leesf" align="middle" /> | [Shaofeng Li](https://github.com/leesf) | PMC, Committer | leesf| | https://avatars.githubusercontent.com/nsivabalan"; style="max-width: 100px" alt="nsivabalan" align="middle" /> | [Sivabalan Narayanan](https://github.com/nsivabalan) | Committer | sivabalan |
[GitHub] [hudi] pratyakshsharma merged pull request #2080: [MINOR]: changed apache id for Pratyaksh
pratyakshsharma merged pull request #2080: URL: https://github.com/apache/hudi/pull/2080 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #1990: [HUDI-1199]: relocated jetty in hudi-utilities-bundle pom
pratyakshsharma commented on pull request #1990: URL: https://github.com/apache/hudi/pull/1990#issuecomment-690086477 @vinothchandar Do you have any concerns or I can merge this now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org