Re: distributed cholesky on systemml

Matthias Boehm Sun, 22 Apr 2018 19:47:37 -0700

well, SystemML decides the execution type of each operation based on
its worst-case memory estimate and the available driver memory budget
to avoid unnecessary overheads for distributed operations. Maybe the
size of the matrix that is fed into cholesky is smaller than the
intermediates used for data generation? Btw, this is a characteristic
we often see for other scripts (e.g., t(X)%*%X in LinearRegDS
potentially runs over large matrices whereas the subsequent solve only
works over a features-by-features matrix) - this was the reason why
distributed operations for builtin functions like solve and cholesky
had low priority in the past.


If you're interested in understanding the execution plan of your
scenarios better, you can use '-explain' or '-explain hops' to see the
memory budgets, estimates, and selected execution types. Also, if you
want to force all operations to spark, you can do so via '-exec
spark'. However, note that this should be done only for debugging
because forcing all operations to spark can be very counterproductive
for performance.

Regards,
Matthias

On Sun, Apr 22, 2018 at 5:07 PM, Qifan Pu <qifan...@gmail.com> wrote:
> and everything before (e.g., I generate the matrix using another DML) was
> indeed run by Spark and shows up on the UI.
>
> On Sun, Apr 22, 2018 at 5:05 PM, Qifan Pu <qifan...@gmail.com> wrote:
>>
>> Thanks Jeremy and Matthias. When I run the script, the cholesky or the inv
>> is executed completely on the driver, and nothing shows up on Spark UI.
>> Is that the expected behavior?
>>
>> On Sun, Apr 22, 2018 at 3:34 PM, Jeremy Nilmeier <nilme...@us.ibm.com>
>> wrote:
>>>
>>> Yes, I also spoke with Sasha about this some time last year.  Thanks for
>>> following up.
>>>
>>> Cheers, J
>>>
>>>
>>> Jerome Nilmeier, PhD
>>> Data Scientist and Engineer
>>> IBM Spark Technology Center
>>> http://www.spark.tc/
>>>
>>>
>>>
>>> ----- Original message -----
>>> From: Matthias Boehm <mboe...@gmail.com>
>>> To: dev@systemml.apache.org
>>> Cc: Qifan Pu <qifan...@gmail.com>, Jeremy Nilmeier <nilme...@us.ibm.com>
>>> Subject: Re: distributed cholesky on systemml
>>> Date: Sun, Apr 22, 2018 2:41 PM
>>>
>>> thanks for the context Jeremy - that helps. I also had an offline
>>> conversion with Sasha and he pointed me to a script that does exactly
>>> that (iterative invert_lower_triangular) combined with a parfor over
>>> independent blocks. We'll merge these scripts soon and I'll reach out
>>> individually as necessary. Thanks everybody for now.
>>>
>>> Regards,
>>> Matthias
>>>
>>> On Sun, Apr 22, 2018 at 12:40 PM, Jeremy Nilmeier <nilme...@us.ibm.com>
>>> wrote:
>>> > This may be a duplicate...it was bounced from the dev list.
>>> >
>>> > I think that scalable triangular inverse will also have similar
>>> > properties,
>>> > in that there is a sequential approach if it uses back substitution.
>>> >
>>> > For most of these algorithms (LU, Cholesky, QR), they are inherently
>>> > sequential, and the focus of the work is on minimizing interprocess
>>> > communication during the operations, which may explain why there was
>>> > only
>>> > limited interest in pursuing this further.
>>> >
>>> > I had originally recommended that the recursive algorithms be rewritten
>>> > as
>>> > iterative algorithms (and in fact provided an example of the LU in
>>> > iterative
>>> > form), which would make the counting of operations more transparent, as
>>> > well
>>> > as revealing possible parallelization points.
>>> >
>>> > Cheers, J
>>> > Jerome Nilmeier, PhD
>>> > Data Scientist and Engineer
>>> > IBM Spark Technology Center
>>> >
>>> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spark.tc_&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3mYOfURw_FSirAnoSv2pWvLSi1psso4F9RdGjEWL6yc&m=VIdNVaIRvibBlaNVAOXLKmxXf7ma-EXrLWbjMd9Bmgo&s=YktpBBbqor3DKzS90Ah75BF6NBYtE4RauITF7QaL87g&e=
>>>
>>> >
>>> >
>>> >
>>> > ----- Original message -----
>>> > From: Matthias Boehm <mboe...@gmail.com>
>>> > To: dev@systemml.apache.org
>>> > Cc: Qifan Pu <qifan...@gmail.com>
>>> > Subject: Re: distributed cholesky on systemml
>>> > Date: Sun, Apr 22, 2018 1:21 AM
>>> >
>>> > sure no problem - thanks again for catching this issue that was hidden
>>> > for a while.
>>> >
>>> > Yes, the same depth-first characteristic applies to the Cholesky
>>> > function as well. In contrast to U_triangular_inv, however, there are
>>> > data dependencies between the blocks per level (at least in the
>>> > current algorithm formulation), which means we cannot use the approach
>>> > I described for U_triangular_inv.
>>> >
>>> > L11 = Cholesky(A11, nb)
>>> > A22 = ... U_triangular_inv(t(L11))
>>> > L22 = Cholesky(A22, nb)
>>> >
>>> > However, note that there are much fewer calls to Cholesky due to the
>>> > switch to the builtin cholesky according to the given min block size.
>>> > For example, in our new test for dimensions 1362 x 1362 and min size
>>> > of 200, we call Cholesky 15 times but U_triangular_inv 2539 times.
>>> >
>>> > For sufficiently large min block size this might be ok for Cholesky,
>>> > because each level also does a number of matrix multiplies that will
>>> > exploit the available parallelism of your cluster. In that regard. you
>>> > might want to experiment with different block sizes and driver memory
>>> > budgets. If I get a chance, I will also run a number of experiments
>>> > and see if we can rewrite these scripts.
>>> >
>>> > Regards,
>>> > Matthias
>>> >
>>> > On Sun, Apr 22, 2018 at 12:48 AM, Qifan Pu <qifan...@gmail.com> wrote:
>>> >> Matthias,
>>> >>
>>> >> Thanks so much for taking time to fix. Really appreciated it.
>>> >> Does the same reasoning apply to the cholesky script? The recursive
>>> >> approach
>>> >> also looks inherently sequential.
>>> >>
>>> >> Best,
>>> >> Qifan
>>> >>
>>> >> On Sat, Apr 21, 2018 at 11:39 PM, Matthias Boehm <mboe...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> just as a quick update: this issue has now been fixed in SystemML
>>> >>> master - it was essentially a missing guard for recursive functions
>>> >>> when checking for unary size-preserving functions during
>>> >>> inter-procedural analysis (IPA).
>>> >>>
>>> >>> However, while working with this recursive cholesky function I came
>>> >>> to
>>> >>> the conclusion that it may need some rework. The current top-down,
>>> >>> depth-first, approach is inherently sequential. This is partially
>>> >>> unnecessary because for the used recursive function U_triangular_inv
>>> >>> (which is called many more times than cholesky), blocks per level are
>>> >>> independent. Therefore, we should look into a bottom-up,
>>> >>> breadth-first
>>> >>> approach to parallelize over the blocks in each level, which could be
>>> >>> done via parfor at script level.
>>> >>>
>>> >>> Regards,
>>> >>> Matthias
>>> >>>
>>> >>> On Sat, Apr 21, 2018 at 6:59 PM, Matthias Boehm <mboe...@gmail.com>
>>> >>> wrote:
>>> >>> > thanks for catching this - I just ran a toy example and this seems
>>> >>> > to
>>> >>> > be a rewrite issue (there are specific right indexing rewrites that
>>> >>> > collapse U[1:k,1:k] and U[1:k,k+1:n] into a single access to U
>>> >>> > which
>>> >>> > helps for large distributed matrices). As a workaround, you can set
>>> >>> > "sysml.optlevel" to 1 (instead of default 2, where 1 disables all
>>> >>> > rewrites), which worked fine for me. I'll fix this later today.
>>> >>> > Also
>>> >>> > I'll fix the naming from "Choleskey" to "Cholesky". Thanks again.
>>> >>> >
>>> >>> > Regards,
>>> >>> > Matthias
>>> >>> >
>>> >>> >
>>> >>> > On Sat, Apr 21, 2018 at 6:28 PM, Qifan Pu <qifan...@gmail.com>
>>> >>> > wrote:
>>> >>> >> Hi Matthias,
>>> >>> >>
>>> >>> >> Thanks for the fast response and detailed information. This is
>>> >>> >> really
>>> >>> >> helpful.
>>> >>> >>
>>> >>> >> I just tried to run it, and was tracing down a indexing bug that
>>> >>> >> can
>>> >>> >> be
>>> >>> >> repeated by simply running the test script of triangle solve[1]
>>> >>> >> Caused by: org.apache.sysml.runtime.DMLRuntimeException: Invalid
>>> >>> >> values
>>> >>> >> for
>>> >>> >> matrix indexing: [1667:3333,1:1666] must be within matrix
>>> >>> >> dimensions
>>> >>> >> [1000,1000]
>>> >>> >>
>>> >>> >>
>>> >>> >> Am I missing some configuration here?
>>> >>> >>
>>> >>> >>
>>> >>> >> [1]
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_systemml_blob_master_scripts_staging_scalable-5Flinalg_test_test-5Ftriangular-5Finv.dml&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3mYOfURw_FSirAnoSv2pWvLSi1psso4F9RdGjEWL6yc&m=FvqDr_AKzY5EAD_GAXIJoot0Z09NtMUt8kLShXcJxqQ&s=zIEgt74yeZzCTqvLCgV_0J8ECApG541uUlbaGMcK8bs&e=
>>> >>> >>
>>> >>> >>
>>> >>> >> Best,
>>> >>> >> Qifan
>>> >>> >>
>>> >>> >>
>>> >>> >> On Sat, Apr 21, 2018 at 4:06 PM, Matthias Boehm
>>> >>> >> <mboe...@gmail.com>
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Hi Qifan,
>>> >>> >>>
>>> >>> >>> thanks for your feedback. You're right, the builtin functions
>>> >>> >>> cholesky, inverse, eigen, solve, svd, qr, and lu are currently
>>> >>> >>> only
>>> >>> >>> supported as single-node operations because they're still
>>> >>> >>> implemented
>>> >>> >>> via Apache commons.math.
>>> >>> >>>
>>> >>> >>> However, there is an experimental script for distributed cholesky
>>> >>> >>> [1]
>>> >>> >>> which uses a recursive approach (with operations that allow for
>>> >>> >>> automatic distributed computation) for matrices larger than a
>>> >>> >>> user-defined block size. Once blocks become small enough, we use
>>> >>> >>> again
>>> >>> >>> the builtin cholesky. Graduating this script would require a
>>> >>> >>> broader
>>> >>> >>> set of experiments (and potential improvements) but it simply did
>>> >>> >>> not
>>> >>> >>> have the highest priority so far. You might want to give it a try
>>> >>> >>> though.
>>> >>> >>>
>>> >>> >>> Thanks again for your feedback - we'll consider a higher priority
>>> >>> >>> for
>>> >>> >>> these distributed operations when discussing the roadmap for the
>>> >>> >>> next
>>> >>> >>> releases.
>>> >>> >>>
>>> >>> >>> [1]
>>> >>> >>>
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_systemml_blob_master_scripts_staging_scalable-5Flinalg_cholesky.dml&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3mYOfURw_FSirAnoSv2pWvLSi1psso4F9RdGjEWL6yc&m=FvqDr_AKzY5EAD_GAXIJoot0Z09NtMUt8kLShXcJxqQ&s=Yrj4GGcTlpZGRw34RoON_oO6-xDUtiIEUcO7-qIOyoc&e=
>>> >>> >>>
>>> >>> >>> Regards,
>>> >>> >>> Matthias
>>> >>> >>>
>>> >>> >>> On Sat, Apr 21, 2018 at 2:15 PM, Qifan Pu <qifan...@gmail.com>
>>> >>> >>> wrote:
>>> >>> >>> > Hi,
>>> >>> >>> >
>>> >>> >>> > I would love to do distributed cholesky on large matrix with
>>> >>> >>> > SystemML. I
>>> >>> >>> > found two related jiras (SYSTEMML-1213, SYSTEMML-1163), but
>>> >>> >>> > AFAIK,
>>> >>> >>> > this
>>> >>> >>> > is
>>> >>> >>> > currently not implemented? I just wanted to check.
>>> >>> >>> >
>>> >>> >>> > Best,
>>> >>> >>> > Qifan
>>> >>> >>
>>> >>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>>
>>>
>>
>

Re: distributed cholesky on systemml

Reply via email to