Hi all, Posting this here to avoid biases from the individual mailing lists on why the product they're using is the best. I'm analyzing tools to replace a section of our pipeline with something more efficient. Currently we're using Kafka Connect to take data from Kafka and put it into S3 (not HDFS cause the connector is paid) in JSON format, then Hive reads JSON from S3 and creates ORC files in HDFS after a group by. I would like to replace this with something that reads Kafka, applies aggregations and windowing in-place and writes HDFS directly. I know that the impending Hive 4 release will support this but Hive LLAP is *very* slow when processing JSON. So far I have a working PySpark application that accomplishes this replacement using structured streaming + windowing, however the decision to evaluate Spark was based on there being a use for Spark in other areas, so I'm interested in getting opinions on other tools that may be better for this use case based on resource usage, ease of use, scalability, resilience, etc.
In terms of absolute must-haves: - Read JSON from Kafka - Ability to summarize data over a window - Write ORC to HDFS - Fault tolerant (can tolerate anywhere from a single machine to an entire cluster going offline unexpectedly while maintaining exactly-once guarantees) Nice-to-haves: - Automatically scalable, both up and down (doesn't matter if standalone, YARN, Kubernetes, etc) - Coding in Java not required So far I've found that Gobblin, Flink and NiFi seem capable of accomplishing what I'm looking for, but neither I nor anyone at my company has any experience with those products, so I was hoping to get some opinions on what the users here would choose and why. I'm also open to other tools that I'm not yet aware of. Thanks for your time, Aaron --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
