Hi all,

I was playing with a super sparse matrix FK, 2e7 by 1e6, with only one
non-zero value on each row, that is 2e7 non-zero values in total.

With driver memory of 1GB and executor memory of 100GB, I found the HOP
"Spark chkpoint", which is used to pin the FK matrix in memory, is really
expensive, as it invokes lots of disk operations.

FK is stored in binary format with 24 blocks, each block is ~45MB, and ~1GB
in total.

For example, with the script as

"""
FK = read($FK)
print("Sum of FK = " + sum(FK))
"""

things worked fine, and it took ~8s.

While with the script as

"""
FK = read($FK)
if (1 == 1) {}
print("Sum of FK = " + sum(FK))
"""

things changed. It took ~92s and I observed lots of disk spills from logs.
Based on the stats from Spark UI, it seems the materialized FK requires
>54GB storage and thus introduces disk operations.

I was wondering, is this the expected behavior of a super sparse matrix?


Regards,
Mingyang

Reply via email to