[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java (slic

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Igor Calabria
You might be affected by this issue: https://github.com/apache/iceberg/issues/8601 It was already patched but it isn't released yet. On Thu, Oct 5, 2023 at 7:47 PM Prashant Sharma wrote: > Hi Sanket, more details might help here. > > How does your spark configuration look like? > > What exactly

Re: Efficiently updating running sums only on new data

2022-10-13 Thread Igor Calabria
You can tag the last entry with each key using the same window you're using for your rolling sum. Something like this: "LEAD(1) OVER your_window IS NULL as last_record". Then, you just UNION ALL the last entry of each key(which you tagged) with the new data and run the same query over the windowed

Re: Help with Shuffle Read performance

2022-09-30 Thread Igor Calabria
needs, given >the fact that you only have 128GB RAM. > > Hope this helps... > > On 9/29/22 2:12 PM, Igor Calabria wrote: > > Hi Everyone, > > I'm running spark 3.2 on kubernetes and have a job with a decently sized > shuffle of almost 4TB. The relevant cluster

Re: Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
instance storage, your 30x30 exchange can run into EBS IOPS limits. You can > investigate that by going to an instance, then to volume, and see > monitoring charts. > > Another thought is that you're essentially giving 4GB per core. That > sounds pretty low, in my experience. >

Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
Hi Everyone, I'm running spark 3.2 on kubernetes and have a job with a decently sized shuffle of almost 4TB. The relevant cluster config is as follows: - 30 Executors. 16 physical cores, configured with 32 Cores for spark - 128 GB RAM - shuffle.partitions is 18k which gives me tasks of around 15