Hi all,

Posting this here to avoid biases from the individual mailing lists on why the 
product they're using is the best. I'm analyzing tools to
replace a section of our pipeline with something more efficient. Currently 
we're using Kafka Connect to take data from Kafka and put it into
S3 (not HDFS cause the connector is paid) in JSON format, then Hive reads JSON 
from S3 and creates ORC files in HDFS after a group by. I
would like to replace this with something that reads Kafka, applies 
aggregations and windowing in-place and writes HDFS directly. I know that
the impending Hive 4 release will support this but Hive LLAP is *very* slow 
when processing JSON. So far I have a working PySpark application
that accomplishes this replacement using structured streaming + windowing, 
however the decision to evaluate Spark was based on there being a
use for Spark in other areas, so I'm interested in getting opinions on other 
tools that may be better for this use case based on resource
usage, ease of use, scalability, resilience, etc.

In terms of absolute must-haves:
- Read JSON from Kafka
- Ability to summarize data over a window
- Write ORC to HDFS
- Fault tolerant (can tolerate anywhere from a single machine to an entire 
cluster going offline unexpectedly while maintaining exactly-once
guarantees)

Nice-to-haves:
- Automatically scalable, both up and down (doesn't matter if standalone, YARN, 
Kubernetes, etc)
- Coding in Java not required

So far I've found that Gobblin, Flink and NiFi seem capable of accomplishing 
what I'm looking for, but neither I nor anyone at my company has
any experience with those products, so I was hoping to get some opinions on 
what the users here would choose and why. I'm also open to other
tools that I'm not yet aware of.

Thanks for your time,
Aaron



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to