Spark structured streaming - efficient way to do lots of aggregations on the same input files

Filip Thu, 21 Jan 2021 08:05:40 -0800

Hi,

I'm considering using Apache Spark for the development of an application.
This would replace a legacy program which reads CSV files and does lots
(tens/hundreds) of aggregations on them. The aggregations are fairly simple:
counts, sums, etc. while applying some filtering conditions on some of the
columns.


I prefer using structured streaming for its simplicity and low-latency. I'd
also like to use full SQL queries (via createOrReplaceTempView). However,
doing multiple queries means Spark will re-read the input files for each one
of them. This seems very inefficient for my use-case.

Does anyone have any suggestions? The only thing I found so far involves
using forEachBatch and manually updating my aggregates. But, I think there
should be a simpler solution for this use case.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark structured streaming - efficient way to do lots of aggregations on the same input files

Reply via email to