[GitHub] [hudi] shengchiqu commented on issue #7229: [SUPPORT] flink connector sink Update the partition value, the old data is still there

2023-01-16 Thread GitBox


shengchiqu commented on issue #7229:
URL: https://github.com/apache/hudi/issues/7229#issuecomment-1384903763

   
   
![image](https://user-images.githubusercontent.com/118461079/212827386-65d85695-1050-457a-a938-abf5e80f34cb.png)
   
   this is c_custkey=1, only 1 record update partition column, because 
change.enable=true and it will keep all change records, flink can not delete 
old data in old partition, so if use batch mode to read hudi with spark , it 
show deplicate data (different partition value in history), how to solve this 
problem?
   i want to buid data warehouse with flink and hudi.
   https://user-images.githubusercontent.com/118461079/212827501-45aa243b-7b9b-450a-b6b1-9ec9e5dd610e.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] shengchiqu commented on issue #7229: [SUPPORT] flink connector sink Update the partition value, the old data is still there

2023-01-16 Thread GitBox


shengchiqu commented on issue #7229:
URL: https://github.com/apache/hudi/issues/7229#issuecomment-1384901686

   
![image](https://user-images.githubusercontent.com/118461079/212825998-bd983c54-6d2a-4c43-8778-8c31050b8399.png)
   this is c_custkey=1, only 1 record update partition column, because 
change.enable=true and it will keep all change records, flink can not delete 
old data in old partition, so if use batch mode to read hudi with spark , it 
show deplicate data (different partition value in history), how to solve this 
problem?
   i want to buid data warehouse with flink and hudi.
   
![image](https://user-images.githubusercontent.com/118461079/212826910-b1724c76-ae25-45d7-95a3-af017755c840.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] shengchiqu commented on issue #7229: [SUPPORT] flink connector sink Update the partition value, the old data is still there

2023-01-16 Thread GitBox


shengchiqu commented on issue #7229:
URL: https://github.com/apache/hudi/issues/7229#issuecomment-1384871989

   @yihua thanks.My problem is that if changelog.enabled is true, flink 
incremental streaming reads are fine, but if spark/hive/flink is used to read 
the hudi directory table offline(batch read), there will be duplicates, all 
updates exist, and there is no de-duplication. Is it mutually exclusive if you 
want to use both incremental stream read and batch read?
   
   changelog.enabled=true => flink incr streaming shows the correct cdc; batch 
read is duplicate
   ```shell
   
+-+++++---+++-+
   |   C_CUSTKEY | C_NAME |  
C_ADDRESS |C_NATIONKEY |C_PHONE |   
  C_ACCTBAL |   C_MKTSEGMENT |  
C_COMMENT |  ts |
   
+-+++++---+++-+
   |   1 | Customer#1 | 
 a |  1 |25-989-741-2988 |  
  711.56 |   BUILDING | to the even, regular platel... | 
2023-01-17 13:43:25.380 |
   |   1 | Customer#1 | 
 a | 12 |25-989-741-2988 |  
  711.56 |   BUILDING | to the even, regular platel... | 
2023-01-17 13:43:31.383 |
   |   1 | Customer#1 | 
 a |  2 |25-989-741-2988 |  
  711.56 |   BUILDING | to the even, regular platel... | 
2023-01-17 13:43:28.381 |
   |   1 | Customer#1 | 
 a |  3 |25-989-741-2988 |  
  711.56 |   BUILDING | to the even, regular platel... | 
2023-01-17 13:43:31.383 |
   
+-+++++---+++-+
   ```
   
   changelog.enabled=false => flink incr streaming is error; batch read is 
no-deplicate and the data is accurate
   ```shell
   Caused by: java.lang.IllegalStateException: Not expected to see delete 
records in this log-scan mode. Check Job Config
at 
org.apache.hudi.common.table.log.HoodieUnMergedLogRecordScanner.processNextDeletedRecord(HoodieUnMergedLogRecordScanner.java:60)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:473)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:343)
... 10 more
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] shengchiqu commented on issue #7229: [SUPPORT] flink connector sink Update the partition value, the old data is still there

2022-11-17 Thread GitBox


shengchiqu commented on issue #7229:
URL: https://github.com/apache/hudi/issues/7229#issuecomment-1319602176

   i try set changelog.enabled=false, the problem solved
   Is this because the changelog mode does not support global indexes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org