machadoluiz commented on issue #8824:
URL: https://github.com/apache/hudi/issues/8824#issuecomment-1570666233
Thank you for your attention regarding our issue, @nfarah86 and
@ad1happy2go. Here is the response for each of your questions:
> following up from slack: 6 years of data in the active timeline is a lot
of data.
>
> 1. what kind of queries are you running? Do you need incremental queries
across 6 years of data?
> 2. Do you have a multi-writer situation where multiple writers are writing
to the same table?
> 3. Can you share the Hudi timeline in the .hoodie folder?
> 4. is the data mostly insert or upsert or a mixed of both?
> 5. How are you partitioning the data?
We acknowledge that 6 years is a large amount of data, but we need to keep
the history of each run over time, as the data are used to make decisions that
affect other companies. For this reason, it is necessary to log the state of
the data at the moment they were used for decision making. For legal reasons,
we need to store this history for auditing and future consultation. We have
records in the database that can have retroactive dates, which may affect the
results of our indicators.
1. Usually, we query the latest version of the data, apply filters, among
other operations. However, we need to store the data history because eventually
it will be necessary to consult a specific period of time. That's why we chose
Hudi.
2. No, all tables have a single script and are not run in parallel.
3. For legal reasons, we cannot share the files of the actual tables, but we
can send from the example described above in which we simulate the problem.
Here is the link to download the files: [Google
Drive](https://drive.google.com/drive/folders/1Iyu9AlwVHSqQLN8cR5diOF2pVZtl96ib)
4. It varies, but the most common methods are "insert_overwrite_table",
"insert" and "upsert". In the example above, we tested using
"insert_overwrite_table".
5. The partitions also vary depending on the table size. Usually, we use the
script execution day (LOAD_DATE) or a specific column of the data itself as the
base for partitioning. Moreover, there are cases of very small tables that are
not partitioned.
> Also Please confirm you are using COW table as I don't see table.type in
configs. Default value is COW.
We are using CoW because we didn't configure it directly, but we tested MoR
and the performance issue remains.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org