Hi Jacek:
The javadoc mentions that we can only consume data from the data frame in the
addBatch method. So, if I would like to save the data to a new sink then I
believe that I will need to collect the data and then save it. This is the
reason I am asking about how to control the size of
Hi,
> If the data is very large then a collect may result in OOM.
That's a general case even in any part of Spark, incl. Spark Structured
Streaming. Why would you collect in addBatch? It's on the driver side and
as anything on the driver, it's a single JVM (and usually not fault
tolerant)
> Do
Thanks Tathagata for your answer.
The reason I was asking about controlling data size is that the javadoc
indicate you can use foreach or collect on the dataframe. If the data is very
large then a collect may result in OOM.
>From your answer it appears that the only way to control the size (in
1. It is all the result data in that trigger. Note that it takes a
DataFrame which is a purely logical representation of data and has no
association with partitions, etc. which are physical representations.
2. If you want to limit the amount of data that is processed in a trigger,
then you should