[GitHub] [hudi] d4r3topk opened a new issue, #8644: [SUPPORT] Data loss while ingesting multiple hudi tables via one glue/spark job with clustering and metadata properties

via GitHub Fri, 05 May 2023 11:38:50 -0700


d4r3topk opened a new issue, #8644:
URL: https://github.com/apache/hudi/issues/8644


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   We have shifted to Hudi 0.13.0 on Spark 3.3 and using it as an external 
library JAR for ingestion of datasets using AWS Glue streaming jobs. We are 
noticing atleast 10-13% data loss in every partition (partitioned by 
region=x/year=xxxx/month=xx/day=xx/hour=xx .
   
   The job that ingests this dataset leading to such data loss splits an 
incoming dynamic frame based on a field and then creates/updates a different 
hudi table based on each field. (Maximum tables - 3 via 1 job)
   
   We have another job that does not do this splitting and just ingests one 
table with the incoming dynamic frame. This job has no data loss.
   
   We have jobs running with the following parameters -
   init_load_config = {"hoodie.datasource.write.operation": "bulk_insert"}
   
   partition_data_config = {
       "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
       "hoodie.datasource.write.partitionpath.field": 
"region,year,month,day,hour",
       "hoodie.datasource.hive_sync.partition_fields": 
"region,year,month,day,hour",
   }
   
   init_common_config = {
       "className": "org.apache.hudi",
       "hoodie.datasource.hive_sync.use_jdbc": "false",
       "hoodie.datasource.hive_sync.mode": "hms",
       "hoodie.datasource.hive_sync.enable": "true",
       "hoodie.datasource.write.reconcile.schema": "true",
       "write.parquet.max.file.size": 268435456,  # 256 MB = 268435456 Bytes
       "hoodie.parquet.small.file.limit": 209715200,  # 200 MB = 209715200 Bytes
       "hoodie.datasource.write.hive_style_partitioning": "true",
   }
   
   metadata_config = {
       "hoodie.archive.async": "true",
       "hoodie.archive.merge.enable": "true",
       "hoodie.datasource.hive_sync.support_timestamp":  "true",
       "clustering.plan.partition.filter.mode": "RECENT_DAYS",
       "clustering.plan.strategy.daybased.lookback.partitions": 2
   }
   
   plug_config = {          
       "hoodie.datasource.write.precombine.field": 
"ApproximateArrivalTimestamp",          
       "hoodie.datasource.write.table.type": "COPY_ON_WRITE",          
       "hoodie.datasource.hive_sync.database": "db_name",          
       "hoodie.table.name": "table_name",          
       "hoodie.datasource.hive_sync.table": "table_name",          
       "hoodie.copyonwrite.record.size.estimate": 2000,          
       "hoodie.clustering.async": "false",      
       "hoodie.metadata.enable": "false"         
   }
   
   When we first noticed data loss, it was with the clustering and metadata 
enabled. However, after reading the regression of metadata and timeline server 
for streaming jobs (w/ COW), we disabled it and created a completely new job 
with new destination and new table names to validate no data loss but we're 
still losing data.
   
   More context  -
   * No error messages in spark logs
   * This dataset splits a data stream by a field and splits it into 3 other 
glue dynamic frames and then runs a for loop to ingest all of these. To keep 
different names, we append the hoodie.table.name" and 
"hoodie.datasource.hive_sync.table" property to append the field value (For 
example, table_name_dataset1 where dataset1 is the value of the field). I was a 
bit doubtful whether 3 table updates is somehow dropping records.
   * Checked hudi cli commits show and then inspected few commits however it 
shows 0 write errors for the ones I've checked so far. (If there are any other 
ways to do this at scale in an automated way to check for all errors I would be 
happy to learn about it)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Write a glue dynamic frame with 5 fields
   2. Split the dynamic frame based on any one field
   3. Iterate through the multiple dynamic frames and ingest into different 
hudi tables
   
   **Expected behavior**
   
   There should be no data loss and counts should exactly match. Incase any 
data loss occurs, the logs should be shown to indicate data loss and potential 
reason.
   
   **Environment Description**
   
   * Hudi version : 0.13.0 / 0.12.1
   
   * Spark version : 3.3
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No. AWS Glue Streaming Job.
   
   
   **Additional context**
   
   Here are the counts for multiple partitions with different parameters and 
hudi/non-hudi
   
   #    type    day     hour    pk_count
   1    hudi-metadata-and-clustering-enabled    04      00      362
   2    hudi-no-metadata-no-clustering-enabled  04      00      366
   3    non-hudi        04      00      366
   4    hudi-metadata-and-clustering-enabled    04      01      262
   5    hudi-no-metadata-no-clustering-enabled  04      01      367
   6    non-hudi        04      01      367
   7    hudi-metadata-and-clustering-enabled    04      02      252
   8    hudi-no-metadata-no-clustering-enabled  04      02      353
   9    non-hudi        04      02      365
   10   hudi-metadata-and-clustering-enabled    04      03      262
   11   hudi-no-metadata-no-clustering-enabled  04      03      360
   12   non-hudi        04      03      366
   13   hudi-metadata-and-clustering-enabled    04      04      248
   14   hudi-no-metadata-no-clustering-enabled  04      04      354
   15   non-hudi        04      04      366
   16   hudi-metadata-and-clustering-enabled    04      05      252
   17   hudi-no-metadata-no-clustering-enabled  04      05      366
   18   non-hudi        04      05      366
   19   hudi-metadata-and-clustering-enabled    04      06      251
   20   hudi-no-metadata-no-clustering-enabled  04      06      365
   21   non-hudi        04      06      365
   22   hudi-metadata-and-clustering-enabled    04      07      256
   23   hudi-no-metadata-no-clustering-enabled  04      07      368
   24   non-hudi        04      07      368
   25   hudi-metadata-and-clustering-enabled    04      08      251
   26   hudi-no-metadata-no-clustering-enabled  04      08      360
   27   non-hudi        04      08      366
   28   hudi-metadata-and-clustering-enabled    04      09      253
   29   hudi-no-metadata-no-clustering-enabled  04      09      359
   30   non-hudi        04      09      365
   31   hudi-metadata-and-clustering-enabled    04      10      258
   32   hudi-no-metadata-no-clustering-enabled  04      10      366
   33   non-hudi        04      10      366
   34   hudi-metadata-and-clustering-enabled    04      11      258
   35   hudi-no-metadata-no-clustering-enabled  04      11      366
   36   non-hudi        04      11      366
   37   hudi-metadata-and-clustering-enabled    04      12      260
   38   hudi-no-metadata-no-clustering-enabled  04      12      360
   39   non-hudi        04      12      366
   40   hudi-metadata-and-clustering-enabled    04      13      252
   41   hudi-no-metadata-no-clustering-enabled  04      13      364
   42   non-hudi        04      13      364
   43   hudi-metadata-and-clustering-enabled    04      14      253
   44   hudi-no-metadata-no-clustering-enabled  04      14      366
   45   non-hudi        04      14      366
   46   hudi-metadata-and-clustering-enabled    04      15      254
   47   hudi-no-metadata-no-clustering-enabled  04      15      361
   48   non-hudi        04      15      367
   49   hudi-metadata-and-clustering-enabled    04      16      254
   50   hudi-no-metadata-no-clustering-enabled  04      16      365
   51   non-hudi        04      16      365
   52   hudi-metadata-and-clustering-enabled    04      17      250
   53   hudi-no-metadata-no-clustering-enabled  04      17      314
   54   non-hudi        04      17      366
   55   hudi-metadata-and-clustering-enabled    04      18      250
   56   hudi-no-metadata-no-clustering-enabled  04      18      360
   57   non-hudi        04      18      367
   58   hudi-metadata-and-clustering-enabled    04      19      253
   59   hudi-no-metadata-no-clustering-enabled  04      19      318
   60   non-hudi        04      19      365
   61   hudi-metadata-and-clustering-enabled    04      20      257
   62   hudi-no-metadata-no-clustering-enabled  04      20      364
   63   non-hudi        04      20      364
   64   hudi-metadata-and-clustering-enabled    04      21      252
   65   hudi-no-metadata-no-clustering-enabled  04      21      364
   66   non-hudi        04      21      364
   67   hudi-metadata-and-clustering-enabled    04      22      266
   68   hudi-no-metadata-no-clustering-enabled  04      22      364
   69   non-hudi        04      22      371
   70   hudi-metadata-and-clustering-enabled    04      23      256
   71   hudi-no-metadata-no-clustering-enabled  04      23      368
   72   non-hudi        04      23      368
   
   **Stacktrace**
   
   No stacktrace as such.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] d4r3topk opened a new issue, #8644: [SUPPORT] Data loss while ingesting multiple hudi tables via one glue/spark job with clustering and metadata properties

Reply via email to