I'm curious if someone could provide a bit deeper insight into how
memory_and_disk_ser persistence level works.
I've noticed that if my cluster has 2.2 TB of memory and I set the
persistence level to memory_only_ser that Spark will use about 2TB and the
storage tab shows 97-99% fraction cached
and it doesn't seem it is possible to report storage
evictions at the moment?
That would be a really nice feature, to be able to set up alerts on such an
event.
On Wed, Jun 16, 2021 at 3:07 PM Zilvinas Saltys <
zilvinas.sal...@verizonmedia.com> wrote:
> Hi,
>
> I'm running Spark 3.0.1
Hi,
I'm running Spark 3.0.1 on AWS. Dynamic allocation is disabled. I'm caching
a large dataset 100% in memory. Before caching it I coalesce the dataset to
1792 partitions. There are 112 executors and 896 cores on the cluster.
The next stage is reading as input those 1792 partitions. The query
The challenge I have is this. There's two streams of data where an event
might look like this in stream1: (time, hashkey, foo1) and in stream2:
(time, hashkey, foo2)
The result after joining should be (time, hashkey, foo1, foo2) .. The join
happens on hashkey and the time difference can be ~30