memory_and_disk persistence level algorithm

2021-08-31 Thread Zilvinas Saltys
I'm curious if someone could provide a bit deeper insight into how memory_and_disk_ser persistence level works. I've noticed that if my cluster has 2.2 TB of memory and I set the persistence level to memory_only_ser that Spark will use about 2TB and the storage tab shows 97-99% fraction cached

Re: Spark PROCESS_LOCAL vs RACK_LOCAL, stage not scheduling tasks

2021-06-16 Thread Zilvinas Saltys
and it doesn't seem it is possible to report storage evictions at the moment? That would be a really nice feature, to be able to set up alerts on such an event. On Wed, Jun 16, 2021 at 3:07 PM Zilvinas Saltys < zilvinas.sal...@verizonmedia.com> wrote: > Hi, > > I'm running Spark 3.0.1

Spark PROCESS_LOCAL vs RACK_LOCAL, stage not scheduling tasks

2021-06-16 Thread Zilvinas Saltys
Hi, I'm running Spark 3.0.1 on AWS. Dynamic allocation is disabled. I'm caching a large dataset 100% in memory. Before caching it I coalesce the dataset to 1792 partitions. There are 112 executors and 896 cores on the cluster. The next stage is reading as input those 1792 partitions. The query

streaming joining multiple streams

2015-02-05 Thread Zilvinas Saltys
The challenge I have is this. There's two streams of data where an event might look like this in stream1: (time, hashkey, foo1) and in stream2: (time, hashkey, foo2) The result after joining should be (time, hashkey, foo1, foo2) .. The join happens on hashkey and the time difference can be ~30