Kafka Hudi pipeline design

Lian Jiang Sun, 19 Jul 2020 22:08:54 -0700

Hi,

I have a kafka topic using a kafka s3 connector to dump data into s3 hourly
in parquet format. These parquet files are partitioned in ingestion time
and each record has fields which are deeply nested jsons. Each record is a
monolithic data containing multiple events each has its own event time.
This causes two issues: 1. slow query by event time; 2. hard to use due to
many levels of exploding. I plan to use the below design to solve these
problems.


[image: Screen Shot 2020-07-19 at 9.44.06 PM.png]
In this design, I still use the s3 parquet dumped by the Kafka S3 connector
as a backfill for the hudi pipeline. This is because the S3 connector
pipeline is easier then the hudi pipeline to set up and will work before
the hudi pipeline is working. Also, the s3 connector pipeline may be more
reliable than the hudi pipeline due to the potential bugs in delta streamer.
The delta streamer will decompose the monolithic kafka record into multiple
event streams. Each event stream is written into one hudi dataset partition
and sorted by its corresponding event time. Such hudi datasets are synced
with hive which is exposed for user query so that they don't need to care
whether the underlying table format is parquet or hudi.
Hopefully, such design improves the query performance due to the fact that
the data set is partitioned and sorted by event times as opposed to kafka
ingest time. Also user experience is improved by querying the extracted
events.

Questions:
1. Do you see any issue for the delta streamer to handle both streaming and
backfill at the same time? I know hudi dataset cannot be written by
multiple writing clients simultaneously. Also, I don't want the delta
streamer to stop handling the streaming data while doing backfill. The
delta streamer will use dynamic allocation. Assuming the cluster has enough
capacity, the load caused by backfill should not be an issue.

2. If I want to time travel to a previous day (e.g. the first day
11:00:00AM PST of the last Month), how can I make hudi 1 and hudi 2 (...
hudi n) in sync. AFAIK, hudi time travel is done by commit instead of
timestamp. Should I do below:
 a. listing the commits of these hudi datasets,
 b. finding the commits closing to each other and being closest to the
desired timestamp,
 c. apply time travel for each hudi dataset.
Is there an easier and more accurate way? Will hudi support time travel by
timestamp in the future as delta lake does?

3. any other concerns about this design?

Thanks for any hints.

Regards
Lian

Kafka Hudi pipeline design

Reply via email to