[GitHub] [hudi] machadoluiz commented on issue #8824: [SUPPORT] Performance and Data Integrity Issues with Hudi for Long-Term Data Retention

2023-06-05 Thread via GitHub


machadoluiz commented on issue #8824:
URL: https://github.com/apache/hudi/issues/8824#issuecomment-1577291016

   @ad1happy2go, given the current scenario, I'm curious to know if there's an 
alternative approach we could consider that might help us avoid or at least 
mitigate this performance trade-off. Are there any specific strategies that 
have proven effective in similar contexts?
   
   Considering the gradual increment in runtime with data growth that we are 
observing, would you say that our current implementation is in line with 
recommended best practices? Or are there adjustments we could make to align 
better with Hudi's intended usage patterns?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] machadoluiz commented on issue #8824: [SUPPORT] Performance and Data Integrity Issues with Hudi for Long-Term Data Retention

2023-06-01 Thread via GitHub


machadoluiz commented on issue #8824:
URL: https://github.com/apache/hudi/issues/8824#issuecomment-1572195137

   @ad1happy2go, the runtime increment happens gradually. In a specific 
example, it reached 2 minutes and 30 seconds around 300 commits (or 10 months). 
This poses a challenge for us, given it represents less than a year's worth of 
data.  Is there any way that could improve this performance, or is this a 
trade-off we must deal with?
   
   Does Hudi perform operations using actual data or just metadata in the 
background? 
   
   Does this mean that if we expand the size of the database, the cost/runtime 
will increase proportionally for managing the metadata? Or is this related only 
to the filenames, in which case this cost will be somewhat constant, regardless 
of the size of the database?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] machadoluiz commented on issue #8824: [SUPPORT] Performance and Data Integrity Issues with Hudi for Long-Term Data Retention

2023-05-31 Thread via GitHub


machadoluiz commented on issue #8824:
URL: https://github.com/apache/hudi/issues/8824#issuecomment-1570666233

   Thank you for your attention regarding our issue, @nfarah86 and 
@ad1happy2go. Here is the response for each of your questions:
   
   > following up from slack: 6 years of data in the active timeline is a lot 
of data.
   > 
   > 1. what kind of queries are you running? Do you need incremental queries 
across 6 years of data?
   > 2. Do you have a multi-writer situation where multiple writers are writing 
to the same table?
   > 3. Can you share the Hudi timeline in the .hoodie folder?
   > 4. is the data mostly insert or upsert or a mixed of both?
   > 5. How are you partitioning the data?
   
   We acknowledge that 6 years is a large amount of data, but we need to keep 
the history of each run over time, as the data are used to make decisions that 
affect other companies. For this reason, it is necessary to log the state of 
the data at the moment they were used for decision making. For legal reasons, 
we need to store this history for auditing and future consultation. We have 
records in the database that can have retroactive dates, which may affect the 
results of our indicators.
   
   1. Usually, we query the latest version of the data, apply filters, among 
other operations. However, we need to store the data history because eventually 
it will be necessary to consult a specific period of time. That's why we chose 
Hudi.
   2. No, all tables have a single script and are not run in parallel.
   3. For legal reasons, we cannot share the files of the actual tables, but we 
can send from the example described above in which we simulate the problem. 
Here is the link to download the files: [Google 
Drive](https://drive.google.com/drive/folders/1Iyu9AlwVHSqQLN8cR5diOF2pVZtl96ib)
   4. It varies, but the most common methods are "insert_overwrite_table", 
"insert" and "upsert". In the example above, we tested using 
"insert_overwrite_table".
   5. The partitions also vary depending on the table size. Usually, we use the 
script execution day (LOAD_DATE) or a specific column of the data itself as the 
base for partitioning. Moreover, there are cases of very small tables that are 
not partitioned.
   
   > Also Please confirm you are using COW table as I don't see table.type in 
configs. Default value is COW.
   
   We are using CoW because we didn't configure it directly, but we tested MoR 
and the performance issue remains.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org