[GitHub] [hudi] bvaradar commented on pull request #2084: [HUDI-802] AWSDmsTransformer does not handle insert and delete of a row in a single batch correctly

2020-09-10 Thread GitBox


bvaradar commented on pull request #2084:
URL: https://github.com/apache/hudi/pull/2084#issuecomment-690863065


   @nsivabalan : Can you please review this. 
   
   Thanks,
   Balaji.V



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar opened a new pull request #2084: [HUDI-802] AWSDmsTransformer does not handle insert and delete of a row in a single batch correctly

2020-09-10 Thread GitBox


bvaradar opened a new pull request #2084:
URL: https://github.com/apache/hudi/pull/2084


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Reopened] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-09-10 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reopened HUDI-802:
-

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-09-10 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-802:

Fix Version/s: (was: 0.6.0)
   0.6.1

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] yanghua closed pull request #2058: [HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env

2020-09-10 Thread GitBox


yanghua closed pull request #2058:
URL: https://github.com/apache/hudi/pull/2058


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2058: [HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env

2020-09-10 Thread GitBox


yanghua commented on pull request #2058:
URL: https://github.com/apache/hudi/pull/2058#issuecomment-690838645


   > @yanghua : You can look at the docker compose file and 
https://github.com/apache/hudi/blob/master/docker/setup_demo.sh We mount hudi 
workspace inside docker to achieve it.
   > 
   > I guess, we can close this PR then ?
   
   Got it. IIUC, you mean:
   
   ```
   volumes:
   - ${HUDI_WS}:/var/hoodie/ws
   ```
   
   > I guess, we can close this PR then ?
   
   Yes.
   
   Actually, My colleagues also do not know it. We rarely use docker. IMO, it 
would be better to describe it into the documentation. wdyt?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-09-10 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193958#comment-17193958
 ] 

Balaji Varadarajan commented on HUDI-802:
-

Thanks [~Weves] for the ticket. I will get your code changes landed. 

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-09-10 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-802:
---

Assignee: Balaji Varadarajan  (was: sivabalan narayanan)

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hj2016 commented on a change in pull request #2078: [MINOR]Add clinbrain to powered by page

2020-09-10 Thread GitBox


hj2016 commented on a change in pull request #2078:
URL: https://github.com/apache/hudi/pull/2078#discussion_r486733426



##
File path: docs/_docs/1_4_powered_by.md
##
@@ -28,6 +29,9 @@ offering real-time analysis on hudi dataset.
 Amazon Web Services is the World's leading cloud services provider. Apache 
Hudi is [pre-installed](https://aws.amazon.com/emr/features/hudi/) with the AWS 
Elastic Map Reduce 
 offering, providing means for AWS users to perform record-level 
updates/deletes and manage storage efficiently.
 
+### Clinbrain
+[Clinbrain](https://www.clinbrain.com/) is the leading of big data platform on 
medical industry, we have built 200 medical big data centers by integrating 
Hudi Data Lake solution in numerous hospitals, hudi provides the abablility to 
upsert and deletes on hdfs, at the same time, it can make the fresh data-stream 
up-to-date effcienctlly in hadoop system with the hudi incremental view.

Review comment:
   I modified it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rafaelhbarros opened a new issue #2083: Kafka readStream performance slow [SUPPORT]

2020-09-10 Thread GitBox


rafaelhbarros opened a new issue #2083:
URL: https://github.com/apache/hudi/issues/2083


   **Describe the problem you faced**
   
   I have a kafka topic that produces 1-2 million records per minute. I'm 
trying to write these records to s3 in the hudi format.
   I can't get it to keep up with the input. I'm running on EMR, m5.xlarge 
driver, 3x c5.xlarge core instances. The data is serialized in avro, and 
deserialized with schema registry (using abris).
   
   **Environment Description**
   
   ```spark-submit \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --master yarn \
   --name hudi-consumer \
   --deploy-mode cluster \
   --conf spark.yarn.submit.waitAppCompletion=false \
   --conf spark.scheduler.mode=FAIR \
   --conf spark.task.maxFailures=10 \
   --conf spark.memory.fraction=0.4 \
   --conf spark.rdd.compress=true \
   --conf spark.kryoserializer.buffer.max=512m \
   --conf spark.memory.storageFraction=0.1 \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.sql.hive.convertMetastoreParquet=false \
   --conf spark.driver.maxResultSize=3g \
   --conf spark.yarn.max.executor.failures=10 \
   --conf spark.file.partitions=10 \
   --conf spark.sql.shuffle.partitions=80 \
   --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
-XX:+CMSClassUnloadingEnabled -XX:+ExitOnOutOfMemoryError" \
   --conf spark.driver.extraJavaOptions="-XX:+PrintTenuringDistribution 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" \
   --driver-memory 4G \
   --executor-memory 5G \
   --executor-cores 4 \
   --num-executors 6 \
   --class  
   ```
   
   Hudi confs:
   
   ```
   hoodie.combine.before.upsert=false 
   hoodie.bulkinsert.shuffle.parallelism=10 
   hoodie.insert.shuffle.parallelism=10 
   hoodie.upsert.shuffle.parallelism=10 
   hoodie.delete.shuffle.parallelism=1
   TABLE_TYPE_OPT_KEY()=COW_TABLE_TYPE_OPT_VAL()
   ```
   
   * Hudi version :
   
   0.5.2-incubating
   
   * Spark version : 2.4.4 (scala 2.12, emr 6.0.0)
   
   * Hive version : N/A
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) :
   
   S3
   
   * Running on Docker? (yes/no) :
   
   No
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-10 Thread GitBox


satishkotha commented on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-690711545


   @vinothchandar As discussed, i added boolean in WriteStatus and removed 
HoodieReplaceStat. See this 
[diff](https://github.com/apache/hudi/pull/2048/commits/94b275dbd20ec82ebe568b47bb28447d92ab996f).
 I committed it as a separate git-sha because this still looks  somewhat 
awkward IMO. Please take a look and I can revert or reimplement in a different 
way
   
   Also, created https://issues.apache.org/jira/browse/HUDI-1276 for cleaning 
replaced file during clean.
   
   I also renamed 'replace' to 'replacecommit' everywhere as you suggested. 
   
   Please let me know if you have additional comments/suggestions



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1266) Add e2e integration tests for replace and insert-overwrite

2020-09-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1266:
-
Fix Version/s: 0.7.0

> Add e2e integration tests for replace and insert-overwrite
> --
>
> Key: HUDI-1266
> URL: https://issues.apache.org/jira/browse/HUDI-1266
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1264) incremental read support with replace

2020-09-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1264:
-
Fix Version/s: 0.7.0

> incremental read support with replace
> -
>
> Key: HUDI-1264
> URL: https://issues.apache.org/jira/browse/HUDI-1264
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>
> initial version, we could fail incremental reads if there is a REPLACE 
> instant. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1262) Documentation Update for Insert Overwrite

2020-09-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1262:
-
Fix Version/s: 0.7.0

> Documentation Update for Insert Overwrite
> -
>
> Key: HUDI-1262
> URL: https://issues.apache.org/jira/browse/HUDI-1262
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1261) CLI tools update to support REPLACE and insert overwrite

2020-09-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1261:
-
Fix Version/s: 0.7.0

> CLI tools update to support REPLACE and insert overwrite
> 
>
> Key: HUDI-1261
> URL: https://issues.apache.org/jira/browse/HUDI-1261
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>
> We introduced replace as part of https://github.com/apache/hudi/pull/2048, we 
> need to change CLI tools inspect REPLACE metadata files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1263) DeltaStreamer changes to support insert overwrite and replace

2020-09-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1263:
-
Fix Version/s: 0.7.0

> DeltaStreamer changes to support insert overwrite and replace
> -
>
> Key: HUDI-1263
> URL: https://issues.apache.org/jira/browse/HUDI-1263
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>
> Follow up from https://github.com/apache/hudi/pull/2048, we want to add delta 
> streamer support for replace



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1260) Reader changes to supportinsert overwrite

2020-09-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1260:
-
Fix Version/s: 0.7.0

> Reader changes to supportinsert overwrite
> -
>
> Key: HUDI-1260
> URL: https://issues.apache.org/jira/browse/HUDI-1260
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>
> Same as HUDI-1072, but creating subtask for insert overwrite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1276) delete replaced file groups during clean

2020-09-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1276:
-
Description: We clean replaced file groups during archival as part of 
PR#2048. But we may want do this during clean stage to prevent storage overhead 
 (was: Same as HUDI-1072, but creating subtask for insert overwrite)

> delete replaced file groups during clean
> 
>
> Key: HUDI-1276
> URL: https://issues.apache.org/jira/browse/HUDI-1276
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
>
> We clean replaced file groups during archival as part of PR#2048. But we may 
> want do this during clean stage to prevent storage overhead



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1276) delete replaced file groups during clean

2020-09-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1276:
-
Fix Version/s: 0.7.0

> delete replaced file groups during clean
> 
>
> Key: HUDI-1276
> URL: https://issues.apache.org/jira/browse/HUDI-1276
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.7.0
>
>
> We clean replaced file groups during archival as part of PR#2048. But we may 
> want do this during clean stage to prevent storage overhead



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1276) delete replaced file groups during clean

2020-09-10 Thread satish (Jira)
satish created HUDI-1276:


 Summary: delete replaced file groups during clean
 Key: HUDI-1276
 URL: https://issues.apache.org/jira/browse/HUDI-1276
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: satish
Assignee: satish


Same as HUDI-1072, but creating subtask for insert overwrite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] abhijeetkushe commented on issue #1737: [SUPPORT]spark streaming create small parquet files

2020-09-10 Thread GitBox


abhijeetkushe commented on issue #1737:
URL: https://github.com/apache/hudi/issues/1737#issuecomment-690685990


   I am facing a similar problem.I am doing a POC for Hudi and  am using with 
the same data for both COW and MOR.I see the compaction happening for both 
table types as new versions of the same file are created.But the cleanup only 
happens for COW
   This the config which works for COW but not for MOR
'hoodie.parquet.small.file.limit': '104857600',  
 'hoodie.compact.inline': True,
 'hoodie.cleaner.commits.retained': 1,  
   What are the values   required for the below setting
   'hoodie.logfile.max.size': '1048576',
 'hoodie.logfile.to.parquet.compression.ratio': 0.35,   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashanthvg89 commented on issue #2065: [SUPPORT] Intermittent IllegalArgumentException while saving to Hudi dataset from Spark streaming job

2020-09-10 Thread GitBox


prashanthvg89 commented on issue #2065:
URL: https://github.com/apache/hudi/issues/2065#issuecomment-690675022


   So far it's running good. Usually, it used to fail after 2 days previously 
and now it's been close to that and no errors. I'll update by end of this week



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #1929: [HUDI-1160] Support update partial fields for CoW table

2020-09-10 Thread GitBox


satishkotha commented on a change in pull request #1929:
URL: https://github.com/apache/hudi/pull/1929#discussion_r486525140



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -94,6 +95,11 @@
   public static final String BULKINSERT_SORT_MODE = 
"hoodie.bulkinsert.sort.mode";
   public static final String DEFAULT_BULKINSERT_SORT_MODE = 
BulkInsertSortMode.GLOBAL_SORT
   .toString();
+  public static final String DELETE_MARKER_FIELD_PROP = 
"hoodie.write.delete.marker.field";

Review comment:
   Is this needed for this change? what is this used for? 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -117,7 +118,17 @@ public boolean commitStats(String instantTime, 
List stats, Opti
 if (extraMetadata.isPresent()) {
   extraMetadata.get().forEach(metadata::addMetadata);
 }
-metadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, config.getSchema());
+String schema = config.getSchema();
+if (config.updatePartialFields()) {
+  try {
+TableSchemaResolver resolver = new 
TableSchemaResolver(table.getMetaClient());
+schema = resolver.getTableAvroSchemaWithoutMetadataFields().toString();
+  } catch (Exception e) {
+// ignore exception.
+schema = config.getSchema();

Review comment:
   We are potentially reducing schema here,  so I think this can lead to 
issues.  Can we throw error? At the least, can you add a LOG here to make sure 
this gets noticed? 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -117,7 +118,17 @@ public boolean commitStats(String instantTime, 
List stats, Opti
 if (extraMetadata.isPresent()) {
   extraMetadata.get().forEach(metadata::addMetadata);
 }
-metadata.addMetadata(HoodieCommitMetadata.SCHEMA_KEY, config.getSchema());
+String schema = config.getSchema();
+if (config.updatePartialFields()) {
+  try {
+TableSchemaResolver resolver = new 
TableSchemaResolver(table.getMetaClient());

Review comment:
   Do you need to create resolver again? Does config.getLastSchema() work 
here?

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/MergeHelper.java
##
@@ -73,7 +74,11 @@
 } else {
   gReader = null;
   gWriter = null;
-  readSchema = upsertHandle.getWriterSchemaWithMetafields();
+  if (table.getConfig().updatePartialFields() && 
!StringUtils.isNullOrEmpty(table.getConfig().getLastSchema())) {
+readSchema = new 
Schema.Parser().parse(table.getConfig().getLastSchema());

Review comment:
   similar comment as before. if we make config.getSchema() to always track 
full table schema, this can be simplified.
   

##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
##
@@ -90,9 +92,19 @@ protected HoodieWriteHandle(HoodieWriteConfig config, String 
instantTime, String
* @param config Write Config
* @return
*/
-  protected static Pair 
getWriterSchemaIncludingAndExcludingMetadataPair(HoodieWriteConfig config) {
+  protected static Pair 
getWriterSchemaIncludingAndExcludingMetadataPair(HoodieWriteConfig config, 
HoodieTable hoodieTable) {
 Schema originalSchema = new Schema.Parser().parse(config.getSchema());
 Schema hoodieSchema = HoodieAvroUtils.addMetadataFields(originalSchema);
+boolean updatePartialFields = config.updatePartialFields();
+if (updatePartialFields) {
+  try {
+TableSchemaResolver resolver = new 
TableSchemaResolver(hoodieTable.getMetaClient());

Review comment:
   This is only applicable for MergeHandle if i understand correctly. Do 
you think its better to override this in MergeHandle?

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/MergeHelper.java
##
@@ -73,7 +74,11 @@
 } else {
   gReader = null;
   gWriter = null;
-  readSchema = upsertHandle.getWriterSchemaWithMetafields();
+  if (table.getConfig().updatePartialFields() && 
!StringUtils.isNullOrEmpty(table.getConfig().getLastSchema())) {
+readSchema = new 
Schema.Parser().parse(table.getConfig().getLastSchema());
+  } else {
+readSchema = upsertHandle.getWriterSchemaWithMetafields();

Review comment:
   we are also calling getWriterSchemaWithMetafields in other places in 
this class (example: line 163). Dont we need to read getLastSchema() there?

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java
##
@@ -237,6 +238,9 @@ protected void finalizeWrite(String instantTime, 
List stats, Ho
* By default, return the writer schema in Write Config for storing in 
commit.
*/
   protected String getSchemaToStoreInCommit() {
+if (config.updatePartialFields() && 

[GitHub] [hudi] bvaradar commented on pull request #1524: [HUDI-801] Adding a way to post process schema after it is fetched

2020-09-10 Thread GitBox


bvaradar commented on pull request #1524:
URL: https://github.com/apache/hudi/pull/1524#issuecomment-690554395


   @pratyakshsharma : I will help getting this landed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #2046: [HUDI-1230] Fix for preventing MOR datasource jobs from hanging via spark-submit

2020-09-10 Thread GitBox


bvaradar commented on pull request #2046:
URL: https://github.com/apache/hudi/pull/2046#issuecomment-690551558


   @umehrot2 : Once you add test, please ping me.
   
   Thanks,
   Balaji.V



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #2058: [HUDI-1259] Cache some framework binaries to speed up the progress of building docker image in local env

2020-09-10 Thread GitBox


bvaradar commented on pull request #2058:
URL: https://github.com/apache/hudi/pull/2058#issuecomment-690546404


   @yanghua : You can look at the docker compose file and 
https://github.com/apache/hudi/blob/master/docker/setup_demo.sh We mount hudi 
workspace inside docker to achieve it.
   
   I guess, we can close this PR then ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-10 Thread GitBox


wangxianghu edited a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-690335395


   @vinothchandar @yanghua @leesf  The ci is green now



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-10 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-690335395


   @vinothchandar @yanghua @leesf  The ci passed now



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bradleyhurley commented on issue #2068: [SUPPORT]Deltastreamer Upsert Very Slow / Never Completes After Initial Data Load

2020-09-10 Thread GitBox


bradleyhurley commented on issue #2068:
URL: https://github.com/apache/hudi/issues/2068#issuecomment-690292104


   Thanks @bvaradar - I think I have read most of the guides and documentation 
that I could find. Is there a formula that should drive the number of 
executors, cores per executor, driver memory, and executor memory?
   
   With a properly sized configuration do you have a ballpark of how long you 
would expect it to take to upsert 100M rows into a Hudi table with 100M 
existing rows with 99%+ of the data being an insert vs update?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu removed a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-10 Thread GitBox


wangxianghu removed a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-689916910


   > @wangxianghu the issue with the tests is that, now most of the tests are 
moved to hudi-spark-client. previously we had split tests into hudi-client and 
others. We need to edit `travis.yml` to adjust the splits again
   
   @vinothchandar could you please help me edit travis.yml to adjust the splits 
.. I am not familiar with that
   thanks :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #1524: [HUDI-801] Adding a way to post process schema after it is fetched

2020-09-10 Thread GitBox


pratyakshsharma commented on pull request #1524:
URL: https://github.com/apache/hudi/pull/1524#issuecomment-690099859


   @afilipchik still working on this? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: Travis CI build asf-site

2020-09-10 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 1303b2d  Travis CI build asf-site
1303b2d is described below

commit 1303b2dfee4e8396c7e5314f55b08d34a6f7b21c
Author: CI 
AuthorDate: Thu Sep 10 09:02:36 2020 +

Travis CI build asf-site
---
 content/community.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/community.html b/content/community.html
index 0e4f06a..242681b 100644
--- a/content/community.html
+++ b/content/community.html
@@ -347,7 +347,7 @@ Committers are chosen by a majority vote of the Apache Hudi 
https://www
   https://avatars.githubusercontent.com/pratyakshsharma; 
style="max-width: 100px" alt="pratyakshsharma" align="middle" />
   https://github.com/pratyakshsharma;>Pratyaksh 
Sharma
   Committer
-  pratyaksh13
+  pratyakshsharma
 
 
   https://avatars.githubusercontent.com/xushiyan; 
style="max-width: 100px" alt="xushiyan" align="middle" />



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2078: [MINOR]Add clinbrain to powered by page

2020-09-10 Thread GitBox


pratyakshsharma commented on a change in pull request #2078:
URL: https://github.com/apache/hudi/pull/2078#discussion_r486176468



##
File path: docs/_docs/1_4_powered_by.md
##
@@ -28,6 +29,9 @@ offering real-time analysis on hudi dataset.
 Amazon Web Services is the World's leading cloud services provider. Apache 
Hudi is [pre-installed](https://aws.amazon.com/emr/features/hudi/) with the AWS 
Elastic Map Reduce 
 offering, providing means for AWS users to perform record-level 
updates/deletes and manage storage efficiently.
 
+### Clinbrain
+[Clinbrain](https://www.clinbrain.com/) is the leading of big data platform on 
medical industry, we have built 200 medical big data centers by integrating 
Hudi Data Lake solution in numerous hospitals, hudi provides the abablility to 
upsert and deletes on hdfs, at the same time, it can make the fresh data-stream 
up-to-date effcienctlly in hadoop system with the hudi incremental view.

Review comment:
   1. is the leading of big data platform on medical industry -> is the 
leader of big data platform and usage in medical industry.
   2. industry, we have  -> industry. We have

##
File path: docs/_docs/1_4_powered_by.md
##
@@ -28,6 +29,9 @@ offering real-time analysis on hudi dataset.
 Amazon Web Services is the World's leading cloud services provider. Apache 
Hudi is [pre-installed](https://aws.amazon.com/emr/features/hudi/) with the AWS 
Elastic Map Reduce 
 offering, providing means for AWS users to perform record-level 
updates/deletes and manage storage efficiently.
 
+### Clinbrain
+[Clinbrain](https://www.clinbrain.com/) is the leading of big data platform on 
medical industry, we have built 200 medical big data centers by integrating 
Hudi Data Lake solution in numerous hospitals, hudi provides the abablility to 
upsert and deletes on hdfs, at the same time, it can make the fresh data-stream 
up-to-date effcienctlly in hadoop system with the hudi incremental view.

Review comment:
   3. hospitals, hudi -> hospitals. Hudi
   4. abablility -> ability
   5. deletes -> delete
   6. effcienctlly -> efficiently





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: changed apache id for Pratyaksh

2020-09-10 Thread pratyakshsharma
This is an automated email from the ASF dual-hosted git repository.

pratyakshsharma pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 3a368b0  changed apache id for Pratyaksh
 new 18943b0  Merge pull request #2080 from pratyakshsharma/team-asf-site
3a368b0 is described below

commit 3a368b096e54bce55edb5412e4ff49a078156810
Author: pratyakshsharma 
AuthorDate: Thu Sep 10 01:12:34 2020 +0530

changed apache id for Pratyaksh
---
 docs/_pages/community.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_pages/community.md b/docs/_pages/community.md
index 2f90b3c..30e3085 100644
--- a/docs/_pages/community.md
+++ b/docs/_pages/community.md
@@ -61,7 +61,7 @@ Committers are chosen by a majority vote of the Apache Hudi 
[PMC](https://www.ap
 | https://avatars.githubusercontent.com/lamber-ken; alt="lamber-ken" 
style="max-width: 100px;" align="middle" /> | 
[lamber-ken](https://github.com/lamber-ken)   | Committer | 
lamberken |
 | https://avatars.githubusercontent.com/n3nash; style="max-width: 
100px" alt="n3nash" align="middle" /> | [Nishith 
Agarwal](https://github.com/n3nash) | PMC, Committer | nagarwal 
|
 | https://avatars.githubusercontent.com/prasannarajaperumal; 
style="max-width: 100px" alt="prasannarajaperumal" align="middle" /> | 
[Prasanna Rajaperumal](https://github.com/prasannarajaperumal) | PMC, Committer 
| prasanna |
-| https://avatars.githubusercontent.com/pratyakshsharma; 
style="max-width: 100px" alt="pratyakshsharma" align="middle" /> | [Pratyaksh 
Sharma](https://github.com/pratyakshsharma)  | Committer
   | pratyaksh13|
+| https://avatars.githubusercontent.com/pratyakshsharma; 
style="max-width: 100px" alt="pratyakshsharma" align="middle" /> | [Pratyaksh 
Sharma](https://github.com/pratyakshsharma)  | Committer
   | pratyakshsharma|
 | https://avatars.githubusercontent.com/xushiyan; style="max-width: 
100px" alt="xushiyan" align="middle" /> | [Raymond 
Xu](https://github.com/xushiyan)  | Committer   | 
xushiyan|
 | https://avatars.githubusercontent.com/leesf; style="max-width: 
100px" alt="leesf" align="middle" /> | [Shaofeng Li](https://github.com/leesf)  
| PMC, Committer   | leesf|
 | https://avatars.githubusercontent.com/nsivabalan; 
style="max-width: 100px" alt="nsivabalan" align="middle" /> | [Sivabalan 
Narayanan](https://github.com/nsivabalan) | Committer | sivabalan  |



[GitHub] [hudi] pratyakshsharma merged pull request #2080: [MINOR]: changed apache id for Pratyaksh

2020-09-10 Thread GitBox


pratyakshsharma merged pull request #2080:
URL: https://github.com/apache/hudi/pull/2080


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #1990: [HUDI-1199]: relocated jetty in hudi-utilities-bundle pom

2020-09-10 Thread GitBox


pratyakshsharma commented on pull request #1990:
URL: https://github.com/apache/hudi/pull/1990#issuecomment-690086477


   @vinothchandar Do you have any concerns or I can merge this now?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2075: [SUPPORT] hoodie.datasource.write.precombine.field not working as expected

2020-09-10 Thread GitBox


bvaradar commented on issue #2075:
URL: https://github.com/apache/hudi/issues/2075#issuecomment-690024060


   @rajgowtham24 : This is a known in 0.5.x and was fixed in 0.6.0 version



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2068: [SUPPORT]Deltastreamer Upsert Very Slow / Never Completes After Initial Data Load

2020-09-10 Thread GitBox


bvaradar commented on issue #2068:
URL: https://github.com/apache/hudi/issues/2068#issuecomment-690012667


   @bradleyhurley : The errors are due to shuffle fetch failures. Increasing 
executor memory and resources in general helps.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-1255) Combine and get updateValue in multiFields

2020-09-10 Thread karl wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

karl wang resolved HUDI-1255.
-
Resolution: Fixed

> Combine and get updateValue in multiFields
> --
>
> Key: HUDI-1255
> URL: https://issues.apache.org/jira/browse/HUDI-1255
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: karl wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> update current value for several fields that you want to change.
> The default payload OverwriteWithLatestAvroPayload overwrite the whole record 
> when 
> compare to orderingVal.This doesn't meet our need when we just want to change 
> specified fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)