Re: JSON in Kafka -> ORC in HDFS - Thoughts on different tools?

Aaron Grubb Sun, 10 Dec 2023 23:50:21 -0800

Hi Michal,

Thanks for your detailed reply, it was very helpful. The motivation for 
replacing Kafka Connect is mostly related to having to run backfills from 
time-to-time - we store all the raw data from Kafka Connect, extract the fields 
we're currently using, then drop the extracted data and keep the raw JSON, and 
that's fine in the case that backfilling is never needed, but when it becomes 
necessary, processing 90+ days of JSON at 12+ billion rows per day using Hive 
LLAP is excruciatingly slow. Therefore we wanted to have the data in ORC format 
as early as possible instead of adding an intermediate job to transform the 
JSON to ORC in the current pipeline. Changing this part of the pipeline over 
should also result in an overall reduction of resources used - nothing crazy 
for this first topic that we're changing over but if it goes well, we have a 
few Kafka Connect clusters that we would be interested in converting, and that 
would also free up a ton of CPU time in Hive.


Thanks,
Aaron


On Thu, 2023-12-07 at 13:32 +0100, Michal Klempa wrote:
Hi Aaron,
I do not know Gobblin, so no advice there.

You write that currently Kafka Connect dumps to files, as you probably already 
know, Kafka Connect can't do the aggregation.
To my knowledge, NiFi is also ETL with local Transformation, there is no state 
maintenance on a global scale. You can write processors to do stateful 
transformation, but for this task, would be tedious in my opinion. I would put 
NiFi out of the game.

Now to the requirements, I assume, you have:
- large volume (~1k events per second)
- small messages (<1k JSONs)
- need to have data in near real-time (seconds at most) after a window 
(aggregation) is triggered for data stake holders to query


Then, it makes sense to think of doing the aggregation on-the-fly, in a 
real-time framework, i.e. real-time ETL non-stop running job.
If your use-case is not satisfying the criteria, e.g low volume, or no 
real-time need (1 to 5 minutes lateness is fine), I would strongly encourage to 
avoid using real-time stateful streaming, as it is complicated to setup, scale, 
maintain, run and mostly: code bug-free. It is a non-stop running application, 
any memory leak > you have restart on OOM every couple of hours. It is hard.

You may have:
- high volume + no real-time (5 minutes lateness is fine)

In that case, running any pyspark every 5 minutes with ad-hoc AWS spot 
instances cluster with batch job is just fine.

You may have:
- low volume + no real-time (5 minutes lateness is fine)
In that case, just run plain 1 instance python script doing the job, 1k to 100k 
events you can just consume from Kafka directly, pack into ORC, and dump on 
S3/HDFS on a single CPU. Use any cron to run it every 5 minutes. Done.


In case your use case is:
- large volume + real-time
for this, Flink and Spark Structured Streaming are both good fit, but there is 
also a thing called Kafka Streams, I would suggest to add this as a competitor. 
Also there is Beam(Google Dataflow), if you are on GCP already. All of them do 
the same job.


Flink vs. Spark Structured Streaming vs. Kafka Streams:
Deployement: Kafka Streams is just one fat-jar, Flink+Spark - you need to 
maintain clusters, but both frameworks are working on being k8s native, but... 
not easy to setup either.
Coding: everything is JVM, Spark has python, Flink added python too. Seems 
there are some python attempts on Kafka Streams approach, no experience though.
Fault Tolerance: I have real-world experience with Flink+Spark Structured 
Streaming, both can restart from checkpoints, Flink have also savepoints which 
is a good feature to start a new job after modifications (but also not easy to 
setup).
Automatically Scalable: I think none of the open-source has this feature 
out-of-the-box (correct me if wrong). You may want to pay Ververica platform 
(Flink authors offering), Databricks (Spark authors offering), there must be 
something from Confluent or competitors,too. Google of course has its Dataflow 
(Beam API). All auto-scaling is pain however, each rescale means reshuffle of 
the data.
Exactly once: To my knowledge, only Flink nowadays offers end-to-end exactly 
once and I would not be sure whether that can be achieved with ORC on HDFS as 
destination. Maybe idempontent ORC writer can be used or some other form of 
"transaction" on the destination must exist.

All in all, if I would be solving your problem, I would first attack the 
requirements list. Whether it can't be done easier. If not, Flink would be my 
choice as I had good experience with it and you can really hack anything 
inside. But prepare yourself, that the requirement list is hard, even if you 
get pipeline up&running in 2 weeks, you surely will re-iterate the decision 
after some incidents in next 6 months.
If you loosen requirements a bit, it becomes easier and easier. Your current 
solution sounds very reasonable to me. You picked something that works out of 
the box (Kafka Connect) and done ELT, where something, that can aggregate out 
of the box (Hive) does it. Why exactly you need to replace it?

Good luck, M.

On Fri, Dec 1, 2023 at 11:38 AM Aaron Grubb 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

Posting this here to avoid biases from the individual mailing lists on why the 
product they're using is the best. I'm analyzing tools to
replace a section of our pipeline with something more efficient. Currently 
we're using Kafka Connect to take data from Kafka and put it into
S3 (not HDFS cause the connector is paid) in JSON format, then Hive reads JSON 
from S3 and creates ORC files in HDFS after a group by. I
would like to replace this with something that reads Kafka, applies 
aggregations and windowing in-place and writes HDFS directly. I know that
the impending Hive 4 release will support this but Hive LLAP is *very* slow 
when processing JSON. So far I have a working PySpark application
that accomplishes this replacement using structured streaming + windowing, 
however the decision to evaluate Spark was based on there being a
use for Spark in other areas, so I'm interested in getting opinions on other 
tools that may be better for this use case based on resource
usage, ease of use, scalability, resilience, etc.

In terms of absolute must-haves:
- Read JSON from Kafka
- Ability to summarize data over a window
- Write ORC to HDFS
- Fault tolerant (can tolerate anywhere from a single machine to an entire 
cluster going offline unexpectedly while maintaining exactly-once
guarantees)

Nice-to-haves:
- Automatically scalable, both up and down (doesn't matter if standalone, YARN, 
Kubernetes, etc)
- Coding in Java not required

So far I've found that Gobblin, Flink and NiFi seem capable of accomplishing 
what I'm looking for, but neither I nor anyone at my company has
any experience with those products, so I was hoping to get some opinions on 
what the users here would choose and why. I'm also open to other
tools that I'm not yet aware of.

Thanks for your time,
Aaron



---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>

Re: JSON in Kafka -> ORC in HDFS - Thoughts on different tools?

Reply via email to