[jira] [Updated] (SYSTEMML-590) Assume Parent's Namespace for Nested UDF calls.
[ https://issues.apache.org/jira/browse/SYSTEMML-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-590: - Description: Currently, if a UDF body involves calling another UDF, the default global namespace is assumed, unless a namespace is explicitly indicated. This becomes a problem when a file contains UDFs, and is then sourced from another script. Imagine a file {{funcs.dml}} as follows: {code} f = function(double x, int a) return (double ans) { x2 = g(x) ans = a * x2 } g = function(double x) return (double ans) { ans = x * x } {code} Then, let's try to call {{f}}: {code} script = """ source ("funcs.dml") as funcs ans = funcs::f(3, 1) print(ans) """ ml.reset() ml.executeScript(script) {code} This results in an error since {{f}} is in the {{funcs}} namespace, but the call to {{g}} assumes {{g}} is still in the default namespace. Clearly, the user intends to the use the {{g}} that is located in the same file. Currently, we would need to adjust {{funcs.dml}} as follows to explicitly assume that {{f}} and {{g}} are in a {{funcs}} namespace: {code} f = function(double x, int a) return (double ans) { x2 = funcs::g(x) ans = a * x2 } g = function(double x) return (double ans) { ans = x * x } {code} Instead, it would be better to simply first look for {{g}} in its parent's namespace. In this case, the "parent" would be the function {{f}}, and the namespace we have selected is {{funcs}}. Then, namespace assumptions would not be necessary. was: Currently, if a UDF body involves calling another UDF, the default global namespace is assumed, unless a namespace is explicitly indicated. This becomes a problem when a file contains UDFs, and is then sourced from another script. Imagine a file {{funcs.dml}} as follows: {code} f = function(double x, int a) return (double ans) { x2 = g(x) ans = a * x2 } g = function(double x) return (double ans) { ans = x * x } {code} Then, let's try to call {{f}}: {code} script = """ source ("funcs.dml") as funcs ans = funcs::f(3, 1) print(ans) """ ml.reset() ml.executeScript(script) {code} This results in an error since {{f}} is in the {{funcs}} namespace, but the call to {{g}} assumes {{g}} is still in the default namespace. Clearly, the user intends to the use the {{g}} that is located in the same file. Currently, we would need to adjust {{funcs.dml}} as follows to explicitly assume that {{f}} and {{g}} are in a {{funcs}} namespace: {code} f = function(double x, int a) return (double ans) { f = function(double x, int a) return (double ans) { x2 = funcs::g(x) ans = a * x2 } g = function(double x) return (double ans) { ans = x * x } {code} Instead, it would be better to simply first look for {{g}} in its parent's namespace. In this case, the "parent" would be the function {{f}}, and the namespace we have selected is {{funcs}}. Then, namespace assumptions would not be necessary. > Assume Parent's Namespace for Nested UDF calls. > --- > > Key: SYSTEMML-590 > URL: https://issues.apache.org/jira/browse/SYSTEMML-590 > Project: SystemML > Issue Type: Sub-task >Reporter: Mike Dusenberry > > Currently, if a UDF body involves calling another UDF, the default global > namespace is assumed, unless a namespace is explicitly indicated. This > becomes a problem when a file contains UDFs, and is then sourced from another > script. > Imagine a file {{funcs.dml}} as follows: > {code} > f = function(double x, int a) return (double ans) { > x2 = g(x) > ans = a * x2 > } > g = function(double x) return (double ans) { > ans = x * x > } > {code} > Then, let's try to call {{f}}: > {code} > script = """ > source ("funcs.dml") as funcs > ans = funcs::f(3, 1) > print(ans) > """ > ml.reset() > ml.executeScript(script) > {code} > This results in an error since {{f}} is in the {{funcs}} namespace, but the > call to {{g}} assumes {{g}} is still in the default namespace. Clearly, the > user intends to the use the {{g}} that is located in the same file. > Currently, we would need to adjust {{funcs.dml}} as follows to explicitly > assume that {{f}} and {{g}} are in a {{funcs}} namespace: > {code} > f = function(double x, int a) return (double ans) { > x2 = funcs::g(x) > ans = a * x2 > } > g = function(double x) return (double ans) { > ans = x * x > } > {code} > Instead, it would be better to simply first look for {{g}} in its parent's > namespace. In this case, the "parent" would be the function {{f}}, and the > namespace we have selected is {{funcs}}. Then, namespace assumptions would > not be necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SYSTEMML-589) Add Default Parameter Values to UDFs
Mike Dusenberry created SYSTEMML-589: Summary: Add Default Parameter Values to UDFs Key: SYSTEMML-589 URL: https://issues.apache.org/jira/browse/SYSTEMML-589 Project: SystemML Issue Type: Sub-task Reporter: Mike Dusenberry This task aims to add default parameter values to UDFs for scalar and boolean types. There may already be runtime support, but the grammar does not seem to allow it. Example that currently works: {code} script = """ f = function(double x, int a) return (double ans) { ans = a * x } ans = f(3, 1) print(ans) """ ml.reset() ml.executeScript(script) {code} Example that would be nice: {code} script = """ f = function(double x, int a=1) return (double ans) { ans = a * x } ans = f(3) print(ans) """ ml.reset() ml.executeScript(script) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SYSTEMML-588) Improve UDFs
Mike Dusenberry created SYSTEMML-588: Summary: Improve UDFs Key: SYSTEMML-588 URL: https://issues.apache.org/jira/browse/SYSTEMML-588 Project: SystemML Issue Type: Epic Reporter: Mike Dusenberry This epic aims to improve the state of user-defined functions (UDFs) in DML & PyDML. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SYSTEMML-587) Improvements Triggered By Deep Learning Work
Mike Dusenberry created SYSTEMML-587: Summary: Improvements Triggered By Deep Learning Work Key: SYSTEMML-587 URL: https://issues.apache.org/jira/browse/SYSTEMML-587 Project: SystemML Issue Type: Umbrella Reporter: Mike Dusenberry Priority: Minor This convenience umbrella tracks all improvements triggered by the work on deep learning (SYSTEMML-540), but not directly related to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SYSTEMML-580) Add Scala LogisticRegression API For Spark ML Pipeline
[ https://issues.apache.org/jira/browse/SYSTEMML-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197797#comment-15197797 ] Mike Dusenberry commented on SYSTEMML-580: -- [PR 70 | https://github.com/apache/incubator-systemml/pull/70] merged as [commit 7ce19c8097f3d24d07be87d9427890834f9a9bea | https://github.com/apache/incubator-systemml/commit/7ce19c8097f3d24d07be87d9427890834f9a9bea]. > Add Scala LogisticRegression API For Spark ML Pipeline > -- > > Key: SYSTEMML-580 > URL: https://issues.apache.org/jira/browse/SYSTEMML-580 > Project: SystemML > Issue Type: New Feature > Components: APIs >Reporter: Tommy Yu >Assignee: Tommy Yu > > I wrote a scala ml pipeline wrapper for LogisticRegression Model as a example > for scala user. > I propose a scala version example since some weakness for java version. > It's not naturally to extend scala class in java code. We need know function > style after compile, like > @Override > public void > org$apache$spark$ml$param$shared$HasElasticNetParam$setter$elasticNetParam_$eq(DoubleParam > arg0) {} > I assume it's set function, but do nothing here > Hard to follow ml parameter style, but define parameter like below > private IntParam icpt = new IntParam(this, "icpt", "Value of intercept"); > private DoubleParam reg = new DoubleParam(this, "reg", "Value of > regularization parameter"); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SYSTEMML-580) Add Scala LogisticRegression API For Spark ML Pipeline
[ https://issues.apache.org/jira/browse/SYSTEMML-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197785#comment-15197785 ] Mike Dusenberry commented on SYSTEMML-580: -- [PR 70 | https://github.com/apache/incubator-systemml/pull/70] submitted. > Add Scala LogisticRegression API For Spark ML Pipeline > -- > > Key: SYSTEMML-580 > URL: https://issues.apache.org/jira/browse/SYSTEMML-580 > Project: SystemML > Issue Type: New Feature > Components: APIs >Reporter: Tommy Yu >Assignee: Tommy Yu > > I wrote a scala ml pipeline wrapper for LogisticRegression Model as a example > for scala user. > I propose a scala version example since some weakness for java version. > It's not naturally to extend scala class in java code. We need know function > style after compile, like > @Override > public void > org$apache$spark$ml$param$shared$HasElasticNetParam$setter$elasticNetParam_$eq(DoubleParam > arg0) {} > I assume it's set function, but do nothing here > Hard to follow ml parameter style, but define parameter like below > private IntParam icpt = new IntParam(this, "icpt", "Value of intercept"); > private DoubleParam reg = new DoubleParam(this, "reg", "Value of > regularization parameter"); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-580) Add Scala LogisticRegression API For Spark ML Pipeline
[ https://issues.apache.org/jira/browse/SYSTEMML-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-580: - Summary: Add Scala LogisticRegression API For Spark ML Pipeline (was: Add Scala LogisticRegression API For Spark Pipeline) > Add Scala LogisticRegression API For Spark ML Pipeline > -- > > Key: SYSTEMML-580 > URL: https://issues.apache.org/jira/browse/SYSTEMML-580 > Project: SystemML > Issue Type: New Feature > Components: APIs >Reporter: Tommy Yu >Assignee: Tommy Yu > > I wrote a scala ml pipeline wrapper for LogisticRegression Model as a example > for scala user. > I propose a scala version example since some weakness for java version. > It's not naturally to extend scala class in java code. We need know function > style after compile, like > @Override > public void > org$apache$spark$ml$param$shared$HasElasticNetParam$setter$elasticNetParam_$eq(DoubleParam > arg0) {} > I assume it's set function, but do nothing here > Hard to follow ml parameter style, but define parameter like below > private IntParam icpt = new IntParam(this, "icpt", "Value of intercept"); > private DoubleParam reg = new DoubleParam(this, "reg", "Value of > regularization parameter"); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SYSTEMML-580) Add Scala LogisticRegression API For Spark Pipeline
Mike Dusenberry created SYSTEMML-580: Summary: Add Scala LogisticRegression API For Spark Pipeline Key: SYSTEMML-580 URL: https://issues.apache.org/jira/browse/SYSTEMML-580 Project: SystemML Issue Type: New Feature Reporter: Tommy Yu Assignee: Tommy Yu I wrote a scala ml pipeline wrapper for LogisticRegression Model as a example for scala user. I propose a scala version example since some weakness for java version. It's not naturally to extend scala class in java code. We need know function style after compile, like @Override public void org$apache$spark$ml$param$shared$HasElasticNetParam$setter$elasticNetParam_$eq(DoubleParam arg0) {} I assume it's set function, but do nothing here Hard to follow ml parameter style, but define parameter like below private IntParam icpt = new IntParam(this, "icpt", "Value of intercept"); private DoubleParam reg = new DoubleParam(this, "reg", "Value of regularization parameter"); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SYSTEMML-540) Deep Learning
[ https://issues.apache.org/jira/browse/SYSTEMML-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198220#comment-15198220 ] Mike Dusenberry commented on SYSTEMML-540: -- Update: I'm working on an experimental, layers-based framework directly in DML to contain layer abstractions with simple forward/backward APIs for affine, convolution (start with 2D), max-pooling, non-linearities (relu, sigmoid, softmax, etc.), dropout, loss functions, and other layers. As part of this experiment, I'm starting by implementing as much as possible in DML, and then will move to built-in functions as necessary. > Deep Learning > - > > Key: SYSTEMML-540 > URL: https://issues.apache.org/jira/browse/SYSTEMML-540 > Project: SystemML > Issue Type: Epic >Reporter: Mike Dusenberry >Assignee: Mike Dusenberry > > This epic covers the addition of deep learning to SystemML, including: > * Core DML layer abstractions for deep (convolutional, recurrent) neural > nets, with simple forward/backward API: affine, convolution (start with 2D), > max-pooling, non-linearities (relu, sigmoid, softmax), dropout, loss > functions. > * Modularized DML optimizers: (mini-batch, stochastic) gradient descent (w/ > momentum, etc.). > * Additional DML language support as necessary (tensors, built-in functions > such as convolution, function pointers, list structures, etc.). > * Integration with other deep learning frameworks (Caffe, Torch, Theano, > TensoFlow, etc.) via automatic DML code generation. > * etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SYSTEMML-582) Determine If Multiple Builds Are Needed For Different Scala Versions.
Mike Dusenberry created SYSTEMML-582: Summary: Determine If Multiple Builds Are Needed For Different Scala Versions. Key: SYSTEMML-582 URL: https://issues.apache.org/jira/browse/SYSTEMML-582 Project: SystemML Issue Type: New Feature Reporter: Mike Dusenberry -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (SYSTEMML-580) Add Scala LogisticRegression API For Spark ML Pipeline
[ https://issues.apache.org/jira/browse/SYSTEMML-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry resolved SYSTEMML-580. -- Resolution: Fixed > Add Scala LogisticRegression API For Spark ML Pipeline > -- > > Key: SYSTEMML-580 > URL: https://issues.apache.org/jira/browse/SYSTEMML-580 > Project: SystemML > Issue Type: New Feature > Components: APIs >Reporter: Tommy Yu >Assignee: Tommy Yu > > I wrote a scala ml pipeline wrapper for LogisticRegression Model as a example > for scala user. > I propose a scala version example since some weakness for java version. > It's not naturally to extend scala class in java code. We need know function > style after compile, like > @Override > public void > org$apache$spark$ml$param$shared$HasElasticNetParam$setter$elasticNetParam_$eq(DoubleParam > arg0) {} > I assume it's set function, but do nothing here > Hard to follow ml parameter style, but define parameter like below > private IntParam icpt = new IntParam(this, "icpt", "Value of intercept"); > private DoubleParam reg = new DoubleParam(this, "reg", "Value of > regularization parameter"); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-579) Packing our algorithm scripts into JAR
[ https://issues.apache.org/jira/browse/SYSTEMML-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-579: - Description: Packing our algorithm to JAR without look into the user's filesystem. We should look into the possibility of packing our algorithm scripts into the JAR during build time as perhaps a Maven "resource" that would be available to the Java process without needing to look into the user's filesystem. This should help with the Scala API introduced in SYSTEMML-580. One issue I see with the current approach is if a user wishes to attach the SystemML JAR to a cloud notebook (such as Databricks Cloud) in which an environment variable may not be able to be set, the API will not function. was:Packing our algorithm to JAR without look into the user's filesystem. > Packing our algorithm scripts into JAR > -- > > Key: SYSTEMML-579 > URL: https://issues.apache.org/jira/browse/SYSTEMML-579 > Project: SystemML > Issue Type: Task > Components: Algorithms, APIs >Affects Versions: SystemML 0.9 >Reporter: Tommy Yu >Priority: Minor > > Packing our algorithm to JAR without look into the user's filesystem. > We should look into the possibility of packing our algorithm scripts into the > JAR during build time as perhaps a Maven "resource" that would be available > to the Java process without needing to look into the user's filesystem. This > should help with the Scala API introduced in SYSTEMML-580. One issue I see > with the current approach is if a user wishes to attach the SystemML JAR to a > cloud notebook (such as Databricks Cloud) in which an environment variable > may not be able to be set, the API will not function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (SYSTEMML-545) Document Scala build support in Eclipse
[ https://issues.apache.org/jira/browse/SYSTEMML-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry closed SYSTEMML-545. > Document Scala build support in Eclipse > --- > > Key: SYSTEMML-545 > URL: https://issues.apache.org/jira/browse/SYSTEMML-545 > Project: SystemML > Issue Type: Improvement > Components: Build >Reporter: Glenn Weidner >Assignee: Glenn Weidner > > In preparation for [SYSTEMML-543 Refactor MLContext in > Scala|https://issues.apache.org/jira/browse/SYSTEMML-543], the project build > needs to support Scala in Eclipse. Initial investigation and discussion can > be found in [PR70|https://github.com/apache/incubator-systemml/pull/70]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SYSTEMML-581) Add Scala API Tests to Maven Test Suites
Mike Dusenberry created SYSTEMML-581: Summary: Add Scala API Tests to Maven Test Suites Key: SYSTEMML-581 URL: https://issues.apache.org/jira/browse/SYSTEMML-581 Project: SystemML Issue Type: New Feature Reporter: Mike Dusenberry Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-577) Add High-Level "executeScript" API to Python MLContext
[ https://issues.apache.org/jira/browse/SYSTEMML-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-577: - Description: This adds the {{executeScript(...)}} function to the Python MLContext API, and in the process hides the need to use {{registerInput(...)}} and {{registerOutput(...)}} by allowing the user to pass in a dictionary of key:value inputs of any type, and an array of outputs to keep. Example: {code} pnmf = """ // script here """ outputs = ml.executeScript(pnmf, {"X": X_train, "maxiter": 100, "rank": 10}, ["W", "H", "negloglik"]) {code} was: This adds the {{executeScript(...)}} function to the Python MLContext API, and in the process hides the need to use {{registerInput(...)}} and {{registerOutput(...)}} by allowing the user to pass in a dictionary of key:value inputs of any type, and an array of outputs to keep. Example: {code} pnmf = """ // script here """ outputs = ml.executeScript(pnmf, {"X": X_train, "maxiter": 100, "rank": 10}, ["W", "H", "negloglik"]) {code}} > Add High-Level "executeScript" API to Python MLContext > -- > > Key: SYSTEMML-577 > URL: https://issues.apache.org/jira/browse/SYSTEMML-577 > Project: SystemML > Issue Type: Improvement >Reporter: Mike Dusenberry >Assignee: Mike Dusenberry >Priority: Minor > > This adds the {{executeScript(...)}} function to the Python MLContext API, > and in the process hides the need to use {{registerInput(...)}} and > {{registerOutput(...)}} by allowing the user to pass in a dictionary of > key:value inputs of any type, and an array of outputs to keep. > Example: > {code} > pnmf = """ // script here """ > outputs = ml.executeScript(pnmf, {"X": X_train, "maxiter": 100, "rank": 10}, > ["W", "H", "negloglik"]) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SYSTEMML-543) Refactor MLContext in Scala
[ https://issues.apache.org/jira/browse/SYSTEMML-543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179337#comment-15179337 ] Mike Dusenberry commented on SYSTEMML-543: -- [~tommy_cug] Thanks for reaching out! I haven't started on this, so please feel free to work on it. However, I think that the redesign will rely on what [~deron] is working on with SYSTEMML-544, so please coordinate with him! :) > Refactor MLContext in Scala > --- > > Key: SYSTEMML-543 > URL: https://issues.apache.org/jira/browse/SYSTEMML-543 > Project: SystemML > Issue Type: Improvement >Reporter: Mike Dusenberry > > Our {{MLContext}} API relies on a myriad of optional parameters as > conveniences for end-users, which has led to our Java implementation growing > in size. Moving to Scala will allow us to use default parameters and > continue to expand the capabilities of the API in a clean way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-540) Deep Learning
[ https://issues.apache.org/jira/browse/SYSTEMML-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-540: - Description: This epic covers the addition of deep learning to SystemML, including: * Core DML layer abstractions for deep (convolutional) neural nets. * DML language support as necessary. * DML code generation (Caffe, Torch, Theano, TensoFlow, etc. integration) * etc. was: This epic covers the addition of deep learning to SystemML, including: * Core DML layer abstractions for deep (convolutional) neural nets. * DML language support as necessary. * DML code generation (Caffe, Theano, etc. integration) * etc. > Deep Learning > - > > Key: SYSTEMML-540 > URL: https://issues.apache.org/jira/browse/SYSTEMML-540 > Project: SystemML > Issue Type: Epic >Reporter: Mike Dusenberry > > This epic covers the addition of deep learning to SystemML, including: > * Core DML layer abstractions for deep (convolutional) neural nets. > * DML language support as necessary. > * DML code generation (Caffe, Torch, Theano, TensoFlow, etc. integration) > * etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SYSTEMML-512) DML Script With UDFs Results In Out Of Memory Error As Compared to Without UDFs
[ https://issues.apache.org/jira/browse/SYSTEMML-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152761#comment-15152761 ] Mike Dusenberry commented on SYSTEMML-512: -- [~mboehm7] Confirmed -- the OOM issue is indeed related to the young generation heap size. Setting -Xmn=100M with driver memory still set to 1G allows the script to run. Is there anything we can do internally to avoid this? For clarity to anyone else reading this, the long runtime issue is still present. > DML Script With UDFs Results In Out Of Memory Error As Compared to Without > UDFs > --- > > Key: SYSTEMML-512 > URL: https://issues.apache.org/jira/browse/SYSTEMML-512 > Project: SystemML > Issue Type: Bug >Reporter: Mike Dusenberry > Attachments: test1.scala, test2.scala > > > Currently, the following script for running a simple version of Poisson > non-negative matrix factorization (PNMF) runs in linear time as desired: > {code} > # data & args > X = read($X) > X = X+1 # change product IDs to be 1-based, rather than 0-based > V = table(X[,1], X[,2]) > V = V[1:$size,1:$size] > max_iteration = as.integer($maxiter) > rank = as.integer($rank) > # run PNMF > n = nrow(V) > m = ncol(V) > range = 0.01 > W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform") > H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform") > i=0 > while(i < max_iteration) { > H = (H * (t(W) %*% (V/(W%*%H/t(colSums(W)) > W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H)) > i = i + 1; > } > # compute negative log-likelihood > negloglik_temp = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H))) > # write outputs > negloglik = matrix(negloglik_temp, rows=1, cols=1) > write(negloglik, $negloglikout) > write(W, $Wout) > write(H, $Hout) > {code} > However, a small refactoring of this same script to pull the core PNMF > algorithm and the negative log-likelihood computation out into separate UDFs > results in non-linear runtime and a Java out of memory heap error on the same > dataset. > {code} > pnmf = function(matrix[double] V, integer max_iteration, integer rank) return > (matrix[double] W, matrix[double] H) { > n = nrow(V) > m = ncol(V) > > range = 0.01 > W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform") > H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform") > > i=0 > while(i < max_iteration) { > H = (H * (t(W) %*% (V/(W%*%H/t(colSums(W)) > W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H)) > i = i + 1; > } > } > negloglikfunc = function(matrix[double] V, matrix[double] W, matrix[double] > H) return (double negloglik) { > negloglik = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H))) > } > # data & args > X = read($X) > X = X+1 # change product IDs to be 1-based, rather than 0-based > V = table(X[,1], X[,2]) > V = V[1:$size,1:$size] > max_iteration = as.integer($maxiter) > rank = as.integer($rank) > # run PNMF and evaluate > [W, H] = pnmf(V, max_iteration, rank) > negloglik_temp = negloglikfunc(V, W, H) > # write outputs > negloglik = matrix(negloglik_temp, rows=1, cols=1) > write(negloglik, $negloglikout) > write(W, $Wout) > write(H, $Hout) > {code} > The expectation would be that such modularization at the DML level should be > allowed without any impact on performance. > Details: > - Data: Amazon product co-purchasing dataset from Stanford > [http://snap.stanford.edu/data/amazon0601.html | > http://snap.stanford.edu/data/amazon0601.html] > - Execution mode: Spark {{MLContext}}, but should be applicable to > command-line invocation as well. > - Error message: > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:415) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.sparseToDense(MatrixBlock.java:1212) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.examSparsity(MatrixBlock.java:1103) > at > org.apache.sysml.runtime.instructions.cp.MatrixMatrixArithmeticCPInstruction.processInstruction(MatrixMatrixArithmeticCPInstruction.java:60) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:309) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:227) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:169) > at > org.apache.sysml.runtime.controlprogram.WhileProgramBlock.execute(WhileProgramBlock.java:183) > at > org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execute(FunctionProgramBlock.java:115) > at >
[jira] [Comment Edited] (SYSTEMML-512) DML Script With UDFs Results In Out Of Memory Error As Compared to Without UDFs
[ https://issues.apache.org/jira/browse/SYSTEMML-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151623#comment-15151623 ] Mike Dusenberry edited comment on SYSTEMML-512 at 2/18/16 2:52 AM: --- [~mboehm7] I've added two Scala files with code that expresses the issue. {{test1.scala}} works correctly, and {{test2.scala}} has the issue described above. The only difference is the PNMF script stored in {{val pnmf = ...}}. To replicate this, I used {{$SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 1G --jars $SYSTEMML_HOME/target/SystemML.jar}}, and then {{:load test1.scala}} and {{:load test2.scala}} to run the scripts. You will need the Amazon data in the same directory. Also, smaller data sizes (2000) will allow {{test2.scala}} to run to completion, but it will run much slower than {{test1.scala}}.: was (Author: mwdus...@us.ibm.com): [~mboehm7] I've added two Scala files with code that expresses the issue. {{test1.scala}} works correctly, and {{test2.scala}} has the issue described above. The only difference is the PNMF script stored in {{val pnmf = ...}}. To replicate this, I used {{$SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 1G --jars $SYSTEMML_HOME/target/SystemML.jar}}, and then {{:load test1.scala}} and {{:load test2.scala}} to run the scripts. You will need the Amazon data in the same directory. > DML Script With UDFs Results In Out Of Memory Error As Compared to Without > UDFs > --- > > Key: SYSTEMML-512 > URL: https://issues.apache.org/jira/browse/SYSTEMML-512 > Project: SystemML > Issue Type: Bug >Reporter: Mike Dusenberry > Attachments: test1.scala, test2.scala > > > Currently, the following script for running a simple version of Poisson > non-negative matrix factorization (PNMF) runs in linear time as desired: > {code} > # data & args > X = read($X) > X = X+1 # change product IDs to be 1-based, rather than 0-based > V = table(X[,1], X[,2]) > V = V[1:$size,1:$size] > max_iteration = as.integer($maxiter) > rank = as.integer($rank) > # run PNMF > n = nrow(V) > m = ncol(V) > range = 0.01 > W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform") > H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform") > i=0 > while(i < max_iteration) { > H = (H * (t(W) %*% (V/(W%*%H/t(colSums(W)) > W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H)) > i = i + 1; > } > # compute negative log-likelihood > negloglik_temp = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H))) > # write outputs > negloglik = matrix(negloglik_temp, rows=1, cols=1) > write(negloglik, $negloglikout) > write(W, $Wout) > write(H, $Hout) > {code} > However, a small refactoring of this same script to pull the core PNMF > algorithm and the negative log-likelihood computation out into separate UDFs > results in non-linear runtime and a Java out of memory heap error on the same > dataset. > {code} > pnmf = function(matrix[double] V, integer max_iteration, integer rank) return > (matrix[double] W, matrix[double] H) { > n = nrow(V) > m = ncol(V) > > range = 0.01 > W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform") > H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform") > > i=0 > while(i < max_iteration) { > H = (H * (t(W) %*% (V/(W%*%H/t(colSums(W)) > W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H)) > i = i + 1; > } > } > negloglikfunc = function(matrix[double] V, matrix[double] W, matrix[double] > H) return (double negloglik) { > negloglik = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H))) > } > # data & args > X = read($X) > X = X+1 # change product IDs to be 1-based, rather than 0-based > V = table(X[,1], X[,2]) > V = V[1:$size,1:$size] > max_iteration = as.integer($maxiter) > rank = as.integer($rank) > # run PNMF and evaluate > [W, H] = pnmf(V, max_iteration, rank) > negloglik_temp = negloglikfunc(V, W, H) > # write outputs > negloglik = matrix(negloglik_temp, rows=1, cols=1) > write(negloglik, $negloglikout) > write(W, $Wout) > write(H, $Hout) > {code} > The expectation would be that such modularization at the DML level should be > allowed without any impact on performance. > Details: > - Data: Amazon product co-purchasing dataset from Stanford > [http://snap.stanford.edu/data/amazon0601.html | > http://snap.stanford.edu/data/amazon0601.html] > - Execution mode: Spark {{MLContext}}, but should be applicable to > command-line invocation as well. > - Error message: > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:415) > at >
[jira] [Commented] (SYSTEMML-512) DML Script With UDFs Results In Out Of Memory Error As Compared to Without UDFs
[ https://issues.apache.org/jira/browse/SYSTEMML-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151623#comment-15151623 ] Mike Dusenberry commented on SYSTEMML-512: -- [~mboehm7] I've added two Scala files with code that expresses the issue. {{test1.scala}} works correctly, and {{test2.scala}} has the issue described above. The only difference is the PNMF script stored in {{val pnmf = ...}}. To replicate this, I used {{$SPARK_HOME/bin/spark-shell --master local[*] --driver-memory 1G --jars $SYSTEMML_HOME/target/SystemML.jar}}, and then {{:load test1.scala}} and {{:load test2.scala}} to run the scripts. You will need the Amazon data in the same directory. > DML Script With UDFs Results In Out Of Memory Error As Compared to Without > UDFs > --- > > Key: SYSTEMML-512 > URL: https://issues.apache.org/jira/browse/SYSTEMML-512 > Project: SystemML > Issue Type: Bug >Reporter: Mike Dusenberry > Attachments: test1.scala, test2.scala > > > Currently, the following script for running a simple version of Poisson > non-negative matrix factorization (PNMF) runs in linear time as desired: > {code} > # data & args > X = read($X) > X = X+1 # change product IDs to be 1-based, rather than 0-based > V = table(X[,1], X[,2]) > V = V[1:$size,1:$size] > max_iteration = as.integer($maxiter) > rank = as.integer($rank) > # run PNMF > n = nrow(V) > m = ncol(V) > range = 0.01 > W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform") > H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform") > i=0 > while(i < max_iteration) { > H = (H * (t(W) %*% (V/(W%*%H/t(colSums(W)) > W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H)) > i = i + 1; > } > # compute negative log-likelihood > negloglik_temp = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H))) > # write outputs > negloglik = matrix(negloglik_temp, rows=1, cols=1) > write(negloglik, $negloglikout) > write(W, $Wout) > write(H, $Hout) > {code} > However, a small refactoring of this same script to pull the core PNMF > algorithm and the negative log-likelihood computation out into separate UDFs > results in non-linear runtime and a Java out of memory heap error on the same > dataset. > {code} > pnmf = function(matrix[double] V, integer max_iteration, integer rank) return > (matrix[double] W, matrix[double] H) { > n = nrow(V) > m = ncol(V) > > range = 0.01 > W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform") > H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform") > > i=0 > while(i < max_iteration) { > H = (H * (t(W) %*% (V/(W%*%H/t(colSums(W)) > W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H)) > i = i + 1; > } > } > negloglikfunc = function(matrix[double] V, matrix[double] W, matrix[double] > H) return (double negloglik) { > negloglik = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H))) > } > # data & args > X = read($X) > X = X+1 # change product IDs to be 1-based, rather than 0-based > V = table(X[,1], X[,2]) > V = V[1:$size,1:$size] > max_iteration = as.integer($maxiter) > rank = as.integer($rank) > # run PNMF and evaluate > [W, H] = pnmf(V, max_iteration, rank) > negloglik_temp = negloglikfunc(V, W, H) > # write outputs > negloglik = matrix(negloglik_temp, rows=1, cols=1) > write(negloglik, $negloglikout) > write(W, $Wout) > write(H, $Hout) > {code} > The expectation would be that such modularization at the DML level should be > allowed without any impact on performance. > Details: > - Data: Amazon product co-purchasing dataset from Stanford > [http://snap.stanford.edu/data/amazon0601.html | > http://snap.stanford.edu/data/amazon0601.html] > - Execution mode: Spark {{MLContext}}, but should be applicable to > command-line invocation as well. > - Error message: > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:415) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.sparseToDense(MatrixBlock.java:1212) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.examSparsity(MatrixBlock.java:1103) > at > org.apache.sysml.runtime.instructions.cp.MatrixMatrixArithmeticCPInstruction.processInstruction(MatrixMatrixArithmeticCPInstruction.java:60) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:309) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:227) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:169) > at > org.apache.sysml.runtime.controlprogram.WhileProgramBlock.execute(WhileProgramBlock.java:183) > at >
[jira] [Updated] (SYSTEMML-516) Index Range Slicing Should Allow Implicit Upper Or Lower Bounds
[ https://issues.apache.org/jira/browse/SYSTEMML-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-516: - Description: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[,2:] # select all rows, and all columns except the first one {code}. was: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[,2:] # select all rows, and all columns except the first one {code}. > Index Range Slicing Should Allow Implicit Upper Or Lower Bounds > --- > > Key: SYSTEMML-516 > URL: https://issues.apache.org/jira/browse/SYSTEMML-516 > Project: SystemML > Issue Type: Improvement >Reporter: Mike Dusenberry > > DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, > 2:6]}}. However, this currently requires that *both* a lower *and* upper > bound be specified. > It would be useful to be able to specify *either* a lower *or* upper bound, > with the missing bound implicitly added internally. This would allow for > scenarios such as selecting all columns *except* the first one, as in > {code} > data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) > X = X[,2:] # select all rows, and all columns except the first one > {code}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-516) Index Range Slicing Should Allow Implicit Upper Or Lower Bounds
[ https://issues.apache.org/jira/browse/SYSTEMML-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-516: - Description: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified for a given row or column range. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[1:4, 2:] # select rows 1 to 4, and columns 2 to ncol(X) {code}. This is the same functionality that [NumPy provides |http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html]. was: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified for a given row or column range. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[1:4, 2:] # select rows 1 to 4, and columns 2 to ncol(X) {code}. > Index Range Slicing Should Allow Implicit Upper Or Lower Bounds > --- > > Key: SYSTEMML-516 > URL: https://issues.apache.org/jira/browse/SYSTEMML-516 > Project: SystemML > Issue Type: Improvement >Reporter: Mike Dusenberry > > DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, > 2:6]}}. However, this currently requires that *both* a lower *and* upper > bound be specified for a given row or column range. > It would be useful to be able to specify *either* a lower *or* upper bound, > with the missing bound implicitly added internally. This would allow for > scenarios such as selecting all columns *except* the first one, as in > {code} > data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) > X = X[1:4, 2:] # select rows 1 to 4, and columns 2 to ncol(X) > {code}. > This is the same functionality that [NumPy provides > |http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-516) Index Range Slicing Should Allow Implicit Upper Or Lower Bounds
[ https://issues.apache.org/jira/browse/SYSTEMML-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-516: - Description: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified for a given row or column range. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[1:4, 2:] # select rows 1-4, and columns 2-numColumns {code}. was: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[,2:] # select all rows, and all columns except the first one {code}. > Index Range Slicing Should Allow Implicit Upper Or Lower Bounds > --- > > Key: SYSTEMML-516 > URL: https://issues.apache.org/jira/browse/SYSTEMML-516 > Project: SystemML > Issue Type: Improvement >Reporter: Mike Dusenberry > > DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, > 2:6]}}. However, this currently requires that *both* a lower *and* upper > bound be specified for a given row or column range. > It would be useful to be able to specify *either* a lower *or* upper bound, > with the missing bound implicitly added internally. This would allow for > scenarios such as selecting all columns *except* the first one, as in > {code} > data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) > X = X[1:4, 2:] # select rows 1-4, and columns 2-numColumns > {code}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-516) Index Range Slicing Should Allow Implicit Upper Or Lower Bounds
[ https://issues.apache.org/jira/browse/SYSTEMML-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-516: - Description: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified for a given row or column range. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[1:4, 2:] # select rows 1 to 4, and columns 2 to numColumns {code}. was: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified for a given row or column range. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[1:4, 2:] # select rows 1-4, and columns 2-numColumns {code}. > Index Range Slicing Should Allow Implicit Upper Or Lower Bounds > --- > > Key: SYSTEMML-516 > URL: https://issues.apache.org/jira/browse/SYSTEMML-516 > Project: SystemML > Issue Type: Improvement >Reporter: Mike Dusenberry > > DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, > 2:6]}}. However, this currently requires that *both* a lower *and* upper > bound be specified for a given row or column range. > It would be useful to be able to specify *either* a lower *or* upper bound, > with the missing bound implicitly added internally. This would allow for > scenarios such as selecting all columns *except* the first one, as in > {code} > data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) > X = X[1:4, 2:] # select rows 1 to 4, and columns 2 to numColumns > {code}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-516) Index Range Slicing Should Allow Implicit Upper Or Lower Bounds
[ https://issues.apache.org/jira/browse/SYSTEMML-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-516: - Description: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified for a given row or column range. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[1:4, 2:] # select rows 1 to 4, and columns 2 to ncol(X) {code}. was: DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified for a given row or column range. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[1:4, 2:] # select rows 1 to 4, and columns 2 to numColumns {code}. > Index Range Slicing Should Allow Implicit Upper Or Lower Bounds > --- > > Key: SYSTEMML-516 > URL: https://issues.apache.org/jira/browse/SYSTEMML-516 > Project: SystemML > Issue Type: Improvement >Reporter: Mike Dusenberry > > DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, > 2:6]}}. However, this currently requires that *both* a lower *and* upper > bound be specified for a given row or column range. > It would be useful to be able to specify *either* a lower *or* upper bound, > with the missing bound implicitly added internally. This would allow for > scenarios such as selecting all columns *except* the first one, as in > {code} > data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) > X = X[1:4, 2:] # select rows 1 to 4, and columns 2 to ncol(X) > {code}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SYSTEMML-516) Index Range Slicing Should Allow Implicit Upper Or Lower Bounds
Mike Dusenberry created SYSTEMML-516: Summary: Index Range Slicing Should Allow Implicit Upper Or Lower Bounds Key: SYSTEMML-516 URL: https://issues.apache.org/jira/browse/SYSTEMML-516 Project: SystemML Issue Type: Improvement Reporter: Mike Dusenberry DML allows for index slicing of matrices for specified ranges, as in {{X[1:4, 2:6]}}. However, this currently requires that *both* a lower *and* upper bound be specified. It would be useful to be able to specify *either* a lower *or* upper bound, with the missing bound implicitly added internally. This would allow for scenarios such as selecting all columns *except* the first one, as in {code} data = rand(rows=10, cols=20, min=0, max=1, pdf="uniform", sparsity=0.2) X = X[,2:] # select all rows, and all columns except the first one {code}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SYSTEMML-512) DML Script With UDFs Results In Out Of Memory Error As Compared to Without UDFs
[ https://issues.apache.org/jira/browse/SYSTEMML-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-512: - Summary: DML Script With UDFs Results In Out Of Memory Error As Compared to Without UDFs (was: DML Script With UDFs Results In Out Of Memory Error) > DML Script With UDFs Results In Out Of Memory Error As Compared to Without > UDFs > --- > > Key: SYSTEMML-512 > URL: https://issues.apache.org/jira/browse/SYSTEMML-512 > Project: SystemML > Issue Type: Bug >Reporter: Mike Dusenberry > > Currently, the following script for running a simple version of Poisson > non-negative matrix factorization (PNMF) runs in linear time as desired: > {code} > # data & args > X = read($X) > X = X+1 # change product IDs to be 1-based, rather than 0-based > V = table(X[,1], X[,2]) > V = V[1:$size,1:$size] > max_iteration = as.integer($maxiter) > rank = as.integer($rank) > # run PNMF > n = nrow(V) > m = ncol(V) > range = 0.01 > W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform") > H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform") > i=0 > while(i < max_iteration) { > H = (H * (t(W) %*% (V/(W%*%H/t(colSums(W)) > W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H)) > i = i + 1; > } > # compute negative log-likelihood > negloglik_temp = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H))) > # write outputs > negloglik = matrix(negloglik_temp, rows=1, cols=1) > write(negloglik, $negloglikout) > write(W, $Wout) > write(H, $Hout) > {code} > However, a small refactoring of this same script to pull the core PNMF > algorithm and the negative log-likelihood computation out into separate UDFs > results in non-linear runtime and a Java out of memory heap error on the same > dataset. > {code} > pnmf = function(matrix[double] V, integer max_iteration, integer rank) return > (matrix[double] W, matrix[double] H) { > n = nrow(V) > m = ncol(V) > > range = 0.01 > W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform") > H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform") > > i=0 > while(i < max_iteration) { > H = (H * (t(W) %*% (V/(W%*%H/t(colSums(W)) > W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H)) > i = i + 1; > } > } > negloglikfunc = function(matrix[double] V, matrix[double] W, matrix[double] > H) return (double negloglik) { > negloglik = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H))) > } > # data & args > X = read($X) > X = X+1 # change product IDs to be 1-based, rather than 0-based > V = table(X[,1], X[,2]) > V = V[1:$size,1:$size] > max_iteration = as.integer($maxiter) > rank = as.integer($rank) > # run PNMF and evaluate > [W, H] = pnmf(V, max_iteration, rank) > negloglik_temp = negloglikfunc(V, W, H) > # write outputs > negloglik = matrix(negloglik_temp, rows=1, cols=1) > write(negloglik, $negloglikout) > write(W, $Wout) > write(H, $Hout) > {code} > The expectation would be that such modularization at the DML level should be > allowed without any impact on performance. > Details: > - Data: Amazon product co-purchasing dataset from Stanford > [http://snap.stanford.edu/data/amazon0601.html | > http://snap.stanford.edu/data/amazon0601.html] > - Execution mode: Spark {{MLContext}}, but should be applicable to > command-line invocation as well. > - Error message: > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:415) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.sparseToDense(MatrixBlock.java:1212) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.examSparsity(MatrixBlock.java:1103) > at > org.apache.sysml.runtime.instructions.cp.MatrixMatrixArithmeticCPInstruction.processInstruction(MatrixMatrixArithmeticCPInstruction.java:60) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:309) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:227) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:169) > at > org.apache.sysml.runtime.controlprogram.WhileProgramBlock.execute(WhileProgramBlock.java:183) > at > org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execute(FunctionProgramBlock.java:115) > at > org.apache.sysml.runtime.instructions.cp.FunctionCallCPInstruction.processInstruction(FunctionCallCPInstruction.java:177) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:309) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:227) >