Hi Aaron,
I do not know Gobblin, so no advice there.

You write that currently Kafka Connect dumps to files, as you probably
already know, Kafka Connect can't do the aggregation.
To my knowledge, NiFi is also ETL with local Transformation, there is no
state maintenance on a global scale. You can write processors to do
stateful transformation, but for this task, would be tedious in my opinion.
I would put NiFi out of the game.

Now to the requirements, I assume, you have:
- large volume (~1k events per second)
- small messages (<1k JSONs)
- need to have data in near real-time (seconds at most) after a window
(aggregation) is triggered for data stake holders to query

Then, it makes sense to think of doing the aggregation on-the-fly, in a
real-time framework, i.e. real-time ETL non-stop running job.
If your use-case is not satisfying the criteria, e.g low volume, or no
real-time need (1 to 5 minutes lateness is fine), I would strongly
encourage to avoid using real-time stateful streaming, as it is complicated
to setup, scale, maintain, run and mostly: code bug-free. It is a non-stop
running application, any memory leak > you have restart on OOM every couple
of hours. It is hard.

You may have:
- high volume + no real-time (5 minutes lateness is fine)
In that case, running any pyspark every 5 minutes with ad-hoc AWS spot
instances cluster with batch job is just fine.

You may have:
- low volume + no real-time (5 minutes lateness is fine)
In that case, just run plain 1 instance python script doing the job, 1k to
100k events you can just consume from Kafka directly, pack into ORC, and
dump on S3/HDFS on a single CPU. Use any cron to run it every 5 minutes.
Done.

In case your use case is:
- large volume + real-time
for this, Flink and Spark Structured Streaming are both good fit, but there
is also a thing called Kafka Streams, I would suggest to add this as a
competitor. Also there is Beam(Google Dataflow), if you are on GCP already.
All of them do the same job.

Flink vs. Spark Structured Streaming vs. Kafka Streams:
Deployement: Kafka Streams is just one fat-jar, Flink+Spark - you need to
maintain clusters, but both frameworks are working on being k8s native,
but... not easy to setup either.
Coding: everything is JVM, Spark has python, Flink added python too. Seems
there are some python attempts on Kafka Streams approach, no experience
though.
Fault Tolerance: I have real-world experience with Flink+Spark Structured
Streaming, both can restart from checkpoints, Flink have also savepoints
which is a good feature to start a new job after modifications (but also
not easy to setup).
Automatically Scalable: I think none of the open-source has this feature
out-of-the-box (correct me if wrong). You may want to pay Ververica
platform (Flink authors offering), Databricks (Spark authors offering),
there must be something from Confluent or competitors,too. Google of course
has its Dataflow (Beam API). All auto-scaling is pain however, each rescale
means reshuffle of the data.
Exactly once: To my knowledge, only Flink nowadays offers end-to-end
exactly once and I would not be sure whether that can be achieved with ORC
on HDFS as destination. Maybe idempontent ORC writer can be used or some
other form of "transaction" on the destination must exist.

All in all, if I would be solving your problem, I would first attack the
requirements list. Whether it can't be done easier. If not, Flink would be
my choice as I had good experience with it and you can really hack anything
inside. But prepare yourself, that the requirement list is hard, even if
you get pipeline up&running in 2 weeks, you surely will re-iterate the
decision after some incidents in next 6 months.
If you loosen requirements a bit, it becomes easier and easier. Your
current solution sounds very reasonable to me. You picked something that
works out of the box (Kafka Connect) and done ELT, where something, that
can aggregate out of the box (Hive) does it. Why exactly you need to
replace it?

Good luck, M.

On Fri, Dec 1, 2023 at 11:38 AM Aaron Grubb <aa...@kaden.ai> wrote:

> Hi all,
>
> Posting this here to avoid biases from the individual mailing lists on why
> the product they're using is the best. I'm analyzing tools to
> replace a section of our pipeline with something more efficient. Currently
> we're using Kafka Connect to take data from Kafka and put it into
> S3 (not HDFS cause the connector is paid) in JSON format, then Hive reads
> JSON from S3 and creates ORC files in HDFS after a group by. I
> would like to replace this with something that reads Kafka, applies
> aggregations and windowing in-place and writes HDFS directly. I know that
> the impending Hive 4 release will support this but Hive LLAP is *very*
> slow when processing JSON. So far I have a working PySpark application
> that accomplishes this replacement using structured streaming + windowing,
> however the decision to evaluate Spark was based on there being a
> use for Spark in other areas, so I'm interested in getting opinions on
> other tools that may be better for this use case based on resource
> usage, ease of use, scalability, resilience, etc.
>
> In terms of absolute must-haves:
> - Read JSON from Kafka
> - Ability to summarize data over a window
> - Write ORC to HDFS
> - Fault tolerant (can tolerate anywhere from a single machine to an entire
> cluster going offline unexpectedly while maintaining exactly-once
> guarantees)
>
> Nice-to-haves:
> - Automatically scalable, both up and down (doesn't matter if standalone,
> YARN, Kubernetes, etc)
> - Coding in Java not required
>
> So far I've found that Gobblin, Flink and NiFi seem capable of
> accomplishing what I'm looking for, but neither I nor anyone at my company
> has
> any experience with those products, so I was hoping to get some opinions
> on what the users here would choose and why. I'm also open to other
> tools that I'm not yet aware of.
>
> Thanks for your time,
> Aaron
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>

Reply via email to