Hi, I'm considering using Apache Spark for the development of an application. This would replace a legacy program which reads CSV files and does lots (tens/hundreds) of aggregations on them. The aggregations are fairly simple: counts, sums, etc. while applying some filtering conditions on some of the columns.
I prefer using structured streaming for its simplicity and low-latency. I'd also like to use full SQL queries (via createOrReplaceTempView). However, doing multiple queries means Spark will re-read the input files for each one of them. This seems very inefficient for my use-case. Does anyone have any suggestions? The only thing I found so far involves using forEachBatch and manually updating my aggregates. But, I think there should be a simpler solution for this use case. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org