subject:"Spark structured streaming \- efficient way to do lots of aggregations on the same input files"

Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-22 Thread Filip

Hi, I don't have any code for the forEachBatch approach, I mentioned it due to this response to my question on SO: https://stackoverflow.com/a/65803718/1017130 I have added some very simple code below that I think shows what I'm trying to do: val schema = StructType( Array( StructFiel

Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-22 Thread Jacek Laskowski

Hi Filip, Care to share the code behind "The only thing I found so far involves using forEachBatch and manually updating my aggregates. "? I'm not completely sure I understand your use case and hope the code could shed more light on it. Thank you. Pozdrawiam, Jacek Laskowski https://about.m

Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-21 Thread Filip

Hi, I'm considering using Apache Spark for the development of an application. This would replace a legacy program which reads CSV files and does lots (tens/hundreds) of aggregations on them. The aggregations are fairly simple: counts, sums, etc. while applying some filtering conditions on some of