Re: Sparse Matrix Storage Consumption Issue

2017-05-04 Thread Mingyang Wang
Out of curiosity, I increased the driver memory to 10GB, and then all
operations were executed on CP. It took 37.166s but JVM GC took 30.534s. I
was wondering whether this is the expected behavior?

Total elapsed time: 38.093 sec.
Total compilation time: 0.926 sec.
Total execution time: 37.166 sec.
Number of compiled Spark inst: 0.
Number of executed Spark inst: 0.
Cache hits (Mem, WB, FS, HDFS): 0/0/0/1.
Cache writes (WB, FS, HDFS): 0/0/0.
Cache times (ACQr/m, RLS, EXP): 30.400/0.000/0.001/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0.000 sec.
Spark ctx create time (lazy): 0.000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time: 22.302 sec.
Total JVM GC count: 11.
Total JVM GC time: 30.534 sec.
Heavy hitter instructions (name, time, count):
-- 1) uak+ 37.166 sec 1
-- 2) == 0.001 sec 1
-- 3) + 0.000 sec 1
-- 4) print 0.000 sec 1
-- 5) rmvar 0.000 sec 5
-- 6) createvar 0.000 sec 1
-- 7) assignvar 0.000 sec 1
-- 8) cpvar 0.000 sec 1

Regards,
Mingyang

On Thu, May 4, 2017 at 9:48 PM Mingyang Wang  wrote:

> Hi Matthias,
>
> Thanks for the patch.
>
> I have re-run the experiment and observed that there was indeed no more
> memory pressure, but it still took ~90s for this simple script. I was
> wondering what is the bottleneck for this case?
>
>
> Total elapsed time: 94.800 sec.
> Total compilation time: 1.826 sec.
> Total execution time: 92.974 sec.
> Number of compiled Spark inst: 2.
> Number of executed Spark inst: 2.
> Cache hits (Mem, WB, FS, HDFS): 1/0/0/0.
> Cache writes (WB, FS, HDFS): 0/0/0.
> Cache times (ACQr/m, RLS, EXP): 0.000/0.000/0.000/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time: 0.000 sec.
> Spark ctx create time (lazy): 0.860 sec.
> Spark trans counts (par,bc,col):0/0/0.
> Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
> Total JIT compile time: 3.498 sec.
> Total JVM GC count: 5.
> Total JVM GC time: 0.064 sec.
> Heavy hitter instructions (name, time, count):
> -- 1) sp_uak+ 92.597 sec 1
> -- 2) sp_chkpoint 0.377 sec 1
> -- 3) == 0.001 sec 1
> -- 4) print 0.000 sec 1
> -- 5) + 0.000 sec 1
> -- 6) castdts 0.000 sec 1
> -- 7) createvar 0.000 sec 3
> -- 8) rmvar 0.000 sec 7
> -- 9) assignvar 0.000 sec 1
> -- 10) cpvar 0.000 sec 1
>
> Regards,
> Mingyang
>
> On Wed, May 3, 2017 at 8:54 AM Matthias Boehm 
> wrote:
>
>> to summarize, this was an issue of selecting serialized representations
>> for large ultra-sparse matrices. Thanks again for sharing your feedback
>> with us.
>>
>> 1) In-memory representation: In CSR every non-zero will require 12 bytes
>> - this is 240MB in your case. The overall memory consumption, however,
>> depends on the distribution of non-zeros: In CSR, each block with at
>> least one non-zero requires 4KB for row pointers. Assuming uniform
>> distribution (the worst case), this gives us 80GB. This is likely the
>> problem here. Every empty block would have an overhead of 44Bytes but
>> for the worst-case assumption, there are no empty blocks left. We do not
>> use COO for checkpoints because it would slow down subsequent operations.
>>
>> 2) Serialized/on-disk representation: For sparse datasets that are
>> expected to exceed aggregate memory, we used to use a serialized
>> representation (with storage level MEM_AND_DISK_SER) which uses sparse,
>> ultra-sparse, or empty representations. In this form, ultra-sparse
>> blocks require 9 + 16*nnz bytes and empty blocks require 9 bytes.
>> Therefore, with this representation selected, you're dataset should
>> easily fit in aggregate memory. Also, note that chkpoint is only a
>> transformation that persists the rdd, the subsequent operation then
>> pulls the data into memory.
>>
>> At a high-level this was a bug. We missed ultra-sparse representations
>> when introducing an improvement that stores sparse matrices in MCSR
>> format in CSR format on checkpoints which eliminated the need to use a
>> serialized storage level. I just deliver a fix. Now we store such
>> ultra-sparse matrices again in serialized form which should
>> significantly reduce the memory pressure.
>>
>> Regards,
>> Matthias
>>
>> On 5/3/2017 9:38 AM, Mingyang Wang wrote:
>> > Hi all,
>> >
>> > I was playing with a super sparse matrix FK, 2e7 by 1e6, with only one
>> > non-zero value on each row, that is 2e7 non-zero values in total.
>> >
>> > With driver memory of 1GB and executor memory of 100GB, I found the HOP
>> > "Spark chkpoint", which is used to pin the FK matrix in memory, is
>> really
>> > expensive, as it invokes lots of disk operations.
>> >
>> > FK is stored in binary format with 24 blocks, each block is ~45MB, and
>> ~1GB
>> > in total.
>> >
>> > For example, with the script as
>> >
>> > """
>> > FK = read($FK)
>> > print("Sum of FK = " + sum(FK))
>> > """
>> >
>> > things worked fine, and it took ~8s.
>> >
>> > While with the script as
>> >
>> > """
>> > FK = read($FK)
>> > if (1 == 1) {}
>> > print("Sum o

Re: Sparse Matrix Storage Consumption Issue

2017-05-04 Thread Mingyang Wang
Hi Matthias,

Thanks for the patch.

I have re-run the experiment and observed that there was indeed no more
memory pressure, but it still took ~90s for this simple script. I was
wondering what is the bottleneck for this case?


Total elapsed time: 94.800 sec.
Total compilation time: 1.826 sec.
Total execution time: 92.974 sec.
Number of compiled Spark inst: 2.
Number of executed Spark inst: 2.
Cache hits (Mem, WB, FS, HDFS): 1/0/0/0.
Cache writes (WB, FS, HDFS): 0/0/0.
Cache times (ACQr/m, RLS, EXP): 0.000/0.000/0.000/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0.000 sec.
Spark ctx create time (lazy): 0.860 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time: 3.498 sec.
Total JVM GC count: 5.
Total JVM GC time: 0.064 sec.
Heavy hitter instructions (name, time, count):
-- 1) sp_uak+ 92.597 sec 1
-- 2) sp_chkpoint 0.377 sec 1
-- 3) == 0.001 sec 1
-- 4) print 0.000 sec 1
-- 5) + 0.000 sec 1
-- 6) castdts 0.000 sec 1
-- 7) createvar 0.000 sec 3
-- 8) rmvar 0.000 sec 7
-- 9) assignvar 0.000 sec 1
-- 10) cpvar 0.000 sec 1

Regards,
Mingyang

On Wed, May 3, 2017 at 8:54 AM Matthias Boehm 
wrote:

> to summarize, this was an issue of selecting serialized representations
> for large ultra-sparse matrices. Thanks again for sharing your feedback
> with us.
>
> 1) In-memory representation: In CSR every non-zero will require 12 bytes
> - this is 240MB in your case. The overall memory consumption, however,
> depends on the distribution of non-zeros: In CSR, each block with at
> least one non-zero requires 4KB for row pointers. Assuming uniform
> distribution (the worst case), this gives us 80GB. This is likely the
> problem here. Every empty block would have an overhead of 44Bytes but
> for the worst-case assumption, there are no empty blocks left. We do not
> use COO for checkpoints because it would slow down subsequent operations.
>
> 2) Serialized/on-disk representation: For sparse datasets that are
> expected to exceed aggregate memory, we used to use a serialized
> representation (with storage level MEM_AND_DISK_SER) which uses sparse,
> ultra-sparse, or empty representations. In this form, ultra-sparse
> blocks require 9 + 16*nnz bytes and empty blocks require 9 bytes.
> Therefore, with this representation selected, you're dataset should
> easily fit in aggregate memory. Also, note that chkpoint is only a
> transformation that persists the rdd, the subsequent operation then
> pulls the data into memory.
>
> At a high-level this was a bug. We missed ultra-sparse representations
> when introducing an improvement that stores sparse matrices in MCSR
> format in CSR format on checkpoints which eliminated the need to use a
> serialized storage level. I just deliver a fix. Now we store such
> ultra-sparse matrices again in serialized form which should
> significantly reduce the memory pressure.
>
> Regards,
> Matthias
>
> On 5/3/2017 9:38 AM, Mingyang Wang wrote:
> > Hi all,
> >
> > I was playing with a super sparse matrix FK, 2e7 by 1e6, with only one
> > non-zero value on each row, that is 2e7 non-zero values in total.
> >
> > With driver memory of 1GB and executor memory of 100GB, I found the HOP
> > "Spark chkpoint", which is used to pin the FK matrix in memory, is really
> > expensive, as it invokes lots of disk operations.
> >
> > FK is stored in binary format with 24 blocks, each block is ~45MB, and
> ~1GB
> > in total.
> >
> > For example, with the script as
> >
> > """
> > FK = read($FK)
> > print("Sum of FK = " + sum(FK))
> > """
> >
> > things worked fine, and it took ~8s.
> >
> > While with the script as
> >
> > """
> > FK = read($FK)
> > if (1 == 1) {}
> > print("Sum of FK = " + sum(FK))
> > """
> >
> > things changed. It took ~92s and I observed lots of disk spills from
> logs.
> > Based on the stats from Spark UI, it seems the materialized FK requires
> >> 54GB storage and thus introduces disk operations.
> >
> > I was wondering, is this the expected behavior of a super sparse matrix?
> >
> >
> > Regards,
> > Mingyang
> >
>