kpurella opened a new issue #2062:
URL: https://github.com/apache/hudi/issues/2062


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? yes
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org. not yet
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   Seeing Duplicate records in _ro and _rt  table after 2 incremental runs. 
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior: 
   1.Ingest first partition with partitionpath=year=2020/month=08/day=01
   2.Ingest first partition once more with same partitionpath
   3.Ingest second partition with partitionpath=year=2020/month=08/day=02
   
   
   **Expected behavior**
   
   HUDI should merge Deltas to avoid duplicates.
   
   Hudi version : 0.5.2-incubating
   
   Spark version : 2.4.5
   
   Hive version : 2.3.6
   
   Hadoop version : 2.8.5
   
   Storage (HDFS/S3/GCS..) : s3
   
   Running on Docker? (yes/no) : no
   
   EMR : 5.30.1
   
   
   Configuration
   hoodie.index.type=GLOBAL_BLOOM
   hoodie.compact.inline.max.delta.commits=10
   hoodie.datasource.write.table.type=MOR
   hoodie.datasource.write.operation=upsert
   
   Test Data:
   partition year=2020/month=08/day=01
   388128891|13511|1|N|2014-10-17
   328587935|13109|7|A|2015-02-02
   329770530|13113|1|N|2013-07-26
   388128892|13553|7|A|2014-10-17
   388128893|24886|1|N|2014-10-17
   388128894|24887|7|A|2014-10-17
   388128895|24888|1|N|2014-10-17
   388128896|24968|7|A|2014-10-17
   328587936|13110|1|N|2015-02-01
   328587937|13116|7|A|2015-02-01
   328587938|13122|1|A|2015-02-01
   328587939|13248|1|A|2015-02-01
   328587940|13118|3|A|2015-02-01
   388128896|25110|3|A|2020-08-01
   328587935|13119|2|A|2020-08-01
   328587941|13115|2|A|2020-08-01
   
   
   partition year=2020/month=08/day=02
   388128896|25110|6|N|2020-08-02
   328587935|13119|4|N|2020-08-02
   328587941|13115|7|N|2020-08-02
   328587938|13122|7|N|2015-02-02
   328587939|13248|5|A|2015-02-02
   128587939|33248|6|A|2015-02-02
   
    0: jdbc:hive2://xx.xxx.xx.xx:10000> select 
`_hoodie_commit_time`,`_hoodie_record_key`,col1,col2,moddate,partitionpath from 
test_ro where col1='388128896' and col2=25110;
   
+----------------------+----------------------------------+------------+------------+-------------------+----------------------------+
   | _hoodie_commit_time  |        _hoodie_record_key        |   col1    | col2 
 |      moddate      |       partitionpath        |
   
+----------------------+----------------------------------+------------+------------+-------------------+----------------------------+
   | 20200901003201       | col1:388128896,col2:25110  | 388128896  | 25110     
 | 1596240000000000  | year=2020/month=08/day=01  |
   | 20200901003731       | col1:388128896,col2:25110  | 388128896  | 25110     
 | 1596326400000000  | year=2020/month=08/day=02  |
   
+----------------------+----------------------------------+------------+------------+-------------------+----------------------------+
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   Thank you for looking into this, please let me know if you need any more 
details-
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to