Re: function default parameters

2017-04-21 Thread dusenberrymw
Yeah we should adopt the syntax that R and Python both use, in which default 
arguments are defined in the function definition.  

Primitive types such as ints and strings can be set in the function definition, 
and more complex types such as matrices can simply use a null value as the 
default in the function definition, followed by an actual assignment within the 
function body.

In R:
```
f <- function(x=3)
  x

f()  # 3
f(2)  # 2
```

```
f <- function(x=NULL) {
  if (is.null(x))
x = matrix(4, 1, 10)
  x
}

f()  # matrix of 4's
f(matrix(2, 5, 12))  # matrix of 2's
```

Same thing in Python, except it uses `None` instead of `NULL`:
```
def f(x=3):
  return x

f()  # 3
f(2)  # 2
```

```
def f(x=None):
  if x is None:
x = [1,2,3]
  return x

f()  # list [1,2,3]
f([4,5,6])  # list [4,5,6]
```


--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Apr 21, 2017, at 5:40 PM, Deron Eriksson  wrote:
> 
> BTW, that is assuming our algorithms have been converted to functions.
> Deron
> 
> 
> On Fri, Apr 21, 2017 at 5:37 PM, Deron Eriksson 
> wrote:
> 
>> Thank you Matthias. I highly agree with your idea about having a default
>> specification similar to R WRT the function signatures for default values.
>> 
>> This becomes a significant issue for some of our algorithms, where they
>> might take in 10 arguments but default values are should typically be used
>> for  6+ or 7+ of the arguments.
>> 
>> Deron
>> 
>> 
>> On Fri, Apr 21, 2017 at 5:25 PM, Matthias Boehm 
>> wrote:
>> 
>>> well, for arguments passed into dml scripts there is of course ifdef($b,
>>> 2)
>>> but for functions there is indeed no good support. At runtime level we
>>> still support default parameters for scalar arguments at the tail of the
>>> parameter list but I guess at one point the corresponding parser support
>>> was discontinued.
>>> 
>>> I personally would like a default specification similar to R in the
>>> function signature with the corresponding function calls that bind values
>>> to a subset of parameters.
>>> 
>>> Regards,
>>> Matthias
>>> 
>>> On Fri, Apr 21, 2017 at 4:18 PM, Deron Eriksson 
>>> wrote:
>>> 
 Is there a way to set default parameter values using DML? I believe
>>> both R
 and Python offer this capability.
 
 The only solution I could come up with using DML is to pass in a
>>> variable
 that is NaN and cast this to a string and use this string in an if
 conditional statement.
 
 addone = function(double b) return (double a) {
c = ''+b;
if (c == 'NaN') {
b = 2.0
}
a = b + 1;
 }
 
 z=0.0/0.0;
 x = addone(z);
 print(x);
 y = addone(4.0);
 print(y);
 
 Is there a cleaner way to accomplish this, or is DML lacking this R
 feature?
 
 Deron
 
 --
 Deron Eriksson
 Spark Technology Center
 http://www.spark.tc/
 
>>> 
>> 
>> 
>> 
>> --
>> Deron Eriksson
>> Spark Technology Center
>> http://www.spark.tc/
>> 
>> 
> 
> 
> -- 
> Deron Eriksson
> Spark Technology Center
> http://www.spark.tc/


Re: function default parameters

2017-04-21 Thread Deron Eriksson
BTW, that is assuming our algorithms have been converted to functions.
Deron


On Fri, Apr 21, 2017 at 5:37 PM, Deron Eriksson 
wrote:

> Thank you Matthias. I highly agree with your idea about having a default
> specification similar to R WRT the function signatures for default values.
>
> This becomes a significant issue for some of our algorithms, where they
> might take in 10 arguments but default values are should typically be used
> for  6+ or 7+ of the arguments.
>
> Deron
>
>
> On Fri, Apr 21, 2017 at 5:25 PM, Matthias Boehm 
> wrote:
>
>> well, for arguments passed into dml scripts there is of course ifdef($b,
>> 2)
>> but for functions there is indeed no good support. At runtime level we
>> still support default parameters for scalar arguments at the tail of the
>> parameter list but I guess at one point the corresponding parser support
>> was discontinued.
>>
>> I personally would like a default specification similar to R in the
>> function signature with the corresponding function calls that bind values
>> to a subset of parameters.
>>
>> Regards,
>> Matthias
>>
>> On Fri, Apr 21, 2017 at 4:18 PM, Deron Eriksson 
>> wrote:
>>
>> > Is there a way to set default parameter values using DML? I believe
>> both R
>> > and Python offer this capability.
>> >
>> > The only solution I could come up with using DML is to pass in a
>> variable
>> > that is NaN and cast this to a string and use this string in an if
>> > conditional statement.
>> >
>> > addone = function(double b) return (double a) {
>> > c = ''+b;
>> > if (c == 'NaN') {
>> > b = 2.0
>> > }
>> > a = b + 1;
>> > }
>> >
>> > z=0.0/0.0;
>> > x = addone(z);
>> > print(x);
>> > y = addone(4.0);
>> > print(y);
>> >
>> > Is there a cleaner way to accomplish this, or is DML lacking this R
>> > feature?
>> >
>> > Deron
>> >
>> > --
>> > Deron Eriksson
>> > Spark Technology Center
>> > http://www.spark.tc/
>> >
>>
>
>
>
> --
> Deron Eriksson
> Spark Technology Center
> http://www.spark.tc/
>
>


-- 
Deron Eriksson
Spark Technology Center
http://www.spark.tc/


Re: function default parameters

2017-04-21 Thread Deron Eriksson
Thank you Matthias. I highly agree with your idea about having a default
specification similar to R WRT the function signatures for default values.

This becomes a significant issue for some of our algorithms, where they
might take in 10 arguments but default values are should typically be used
for  6+ or 7+ of the arguments.

Deron


On Fri, Apr 21, 2017 at 5:25 PM, Matthias Boehm 
wrote:

> well, for arguments passed into dml scripts there is of course ifdef($b, 2)
> but for functions there is indeed no good support. At runtime level we
> still support default parameters for scalar arguments at the tail of the
> parameter list but I guess at one point the corresponding parser support
> was discontinued.
>
> I personally would like a default specification similar to R in the
> function signature with the corresponding function calls that bind values
> to a subset of parameters.
>
> Regards,
> Matthias
>
> On Fri, Apr 21, 2017 at 4:18 PM, Deron Eriksson 
> wrote:
>
> > Is there a way to set default parameter values using DML? I believe both
> R
> > and Python offer this capability.
> >
> > The only solution I could come up with using DML is to pass in a variable
> > that is NaN and cast this to a string and use this string in an if
> > conditional statement.
> >
> > addone = function(double b) return (double a) {
> > c = ''+b;
> > if (c == 'NaN') {
> > b = 2.0
> > }
> > a = b + 1;
> > }
> >
> > z=0.0/0.0;
> > x = addone(z);
> > print(x);
> > y = addone(4.0);
> > print(y);
> >
> > Is there a cleaner way to accomplish this, or is DML lacking this R
> > feature?
> >
> > Deron
> >
> > --
> > Deron Eriksson
> > Spark Technology Center
> > http://www.spark.tc/
> >
>



-- 
Deron Eriksson
Spark Technology Center
http://www.spark.tc/


Re: function default parameters

2017-04-21 Thread Matthias Boehm
well, for arguments passed into dml scripts there is of course ifdef($b, 2)
but for functions there is indeed no good support. At runtime level we
still support default parameters for scalar arguments at the tail of the
parameter list but I guess at one point the corresponding parser support
was discontinued.

I personally would like a default specification similar to R in the
function signature with the corresponding function calls that bind values
to a subset of parameters.

Regards,
Matthias

On Fri, Apr 21, 2017 at 4:18 PM, Deron Eriksson 
wrote:

> Is there a way to set default parameter values using DML? I believe both R
> and Python offer this capability.
>
> The only solution I could come up with using DML is to pass in a variable
> that is NaN and cast this to a string and use this string in an if
> conditional statement.
>
> addone = function(double b) return (double a) {
> c = ''+b;
> if (c == 'NaN') {
> b = 2.0
> }
> a = b + 1;
> }
>
> z=0.0/0.0;
> x = addone(z);
> print(x);
> y = addone(4.0);
> print(y);
>
> Is there a cleaner way to accomplish this, or is DML lacking this R
> feature?
>
> Deron
>
> --
> Deron Eriksson
> Spark Technology Center
> http://www.spark.tc/
>


function default parameters

2017-04-21 Thread Deron Eriksson
Is there a way to set default parameter values using DML? I believe both R
and Python offer this capability.

The only solution I could come up with using DML is to pass in a variable
that is NaN and cast this to a string and use this string in an if
conditional statement.

addone = function(double b) return (double a) {
c = ''+b;
if (c == 'NaN') {
b = 2.0
}
a = b + 1;
}

z=0.0/0.0;
x = addone(z);
print(x);
y = addone(4.0);
print(y);

Is there a cleaner way to accomplish this, or is DML lacking this R feature?

Deron

-- 
Deron Eriksson
Spark Technology Center
http://www.spark.tc/


Jenkins build is back to normal : SystemML-DailyTest #943

2017-04-21 Thread jenkins
See 



Re: Randomly Selecting rows from a dataframe

2017-04-21 Thread Matthias Boehm
you can take for example a 1%  sample of rows via a permutation matrix
(specifically selection matrix) as follows

I = (rand(rows=nrow(X), cols=1, min=0, max=1) <= 0.01);
P = removeEmpty(target=diag(I), margin="rows");
Xsample = P %*% X;

or via removeEmpty and selection vector

I = (rand(rows=nrow(X), cols=1, min=0, max=1) <= 0.01);
Xsample = removeEmpty(target=X, margin="rows", select=I);

Both should be compiled internally to very similar plans.

Regards,
Matthias

On Fri, Apr 21, 2017 at 1:42 PM, arijit chakraborty 
wrote:

> Hi,
>
>
> Suppose I've a dataframe of 10 variables (X1-X10) and have 1000 rows. Now
> I want to randomly select rows so that I've a subset of the dataset.
>
>
> Can anyone please help me to solve this problem?
>
>
> I tried the following code:
>
>
> randSample = sample(nrow(dataframe), 200);
>
>
> This gives me a column matrix with position of the row randomly selected.
> But I could not able to solve how from this matrix I can subset data from
> original dataframe.
>
>
> Thank you!
>
>
> Arijit
>


Re: Table

2017-04-21 Thread arijit chakraborty
Thank you Matthias for your answer! I'm still not very clear about this concept 
of "table". I'll try to explore it further in case I get better understanding.


Thank you!


From: Matthias Boehm 
Sent: Saturday, April 22, 2017 12:17:23 AM
To: dev@systemml.incubator.apache.org
Subject: Re: Table

The input vectors to table are interpreted as row indexes and column
indexes, respectively. Without weights, we add 1, otherwise the
corresponding weight value to the output cells.

So in your example you have constant row indexes of 1 but a seq(1,10)
for column indexes and hence you get a 1x10 output matrix.


Regards,
Matthias

On 4/21/2017 4:00 AM, arijit chakraborty wrote:
> Hi,
>
>
> I was trying to understand what the "table" function does. In the documents, 
> it says:
>
>
> "Returns the contingency table of two vectors A and B. The resulting table F 
> consists of max(A) rows and max(B) columns."
>
>
> Suppose I've 2 matrix A and B of this form:
>
>
> A = matrix(1, 1, 10)
> B = matrix(seq(10,1, -1), 10, 1)
>
>
> I should have a matrix of form
>
> C = matrix(1, 10, 1).
>
>
> But I'm getting
>
> C  = matrix(1, 1, 10).
>
>
> Also, what is the difference between "Group form" and "Scalar form"?
>
>
> Thank you!
>
> Arijit
>


Re: GSoC : Getting started contributions

2017-04-21 Thread Krishna Kalyan
Thanks Nakul, Arvind and Matthias for your suggestions.
I am currently playing around with System-ML, will also take a look at
 SYSTEMML-546
+ plan to run some performance tests on my local system this weekend.

Regards,
Krishna

On Fri, Apr 21, 2017 at 3:04 PM, Nakul Jindal  wrote:

> Hi Krishna,
>
> What Arvind is describing is in essence a large part of you GSoC proposal.
> You should work on this if and when your proposal gets approved. (we don't
> know whether it has been approved and even if we did, we couldn't say).
> In the meantime, I encourage you to play around with SystemML, go through
> the JIRA site, (look at SYSTEMML-546 as Matthias suggested) and ask
> questions on this mailing list that you may have.
>
> Thanks,
> Nakul
>
>
> On Fri, Apr 21, 2017 at 11:00 AM, Arvind Surve 
> wrote:
>
> > Hi Krishna,
> > There is immediate need for SystemML project to run performance testing
> > and analyze results efficiently.Though this is small part of your overall
> > GSoC project, its important to start with it and grow from there.In short
> > term it will help SystemML project to expedite release cycles and in long
> > run you will get head start on the project.
> > What we need for short run to get performance testing for every release
> > cycle or even beyond.   1. How to setup environment with configurable
> > parameters quickly. (We have scripts may need some tweaking or some
> > additional configuration)   2. Run performance scripts with configuration
> > option of Data size (8GB, 80GB, 8000GB etc or ALL), Different set of
> > algorithms regression, classification or all   3. Collect time required
> to
> > run individual algorithm for a given size. and store it in CSV or any
> > suitable format file for further processing.   4. Compare results
> obtained
> > from step 3 to previous runs (previous release, RC etc)   5. Generate
> > report indicating, Failures scenarios, Outliers (time taken was more than
> > tolerance level (say x%), and Successful cases.Each to be
> separated
> > out so that reading those reports will be easy.
> >
> >  Arvind Surve | Spark Technology Center  | http://www.spark.tc/
> >
> >   From: Matthias Boehm 
> >  To: dev@systemml.incubator.apache.org
> >  Sent: Saturday, April 15, 2017 3:27 PM
> >  Subject: Re: GSoC : Getting started contributions
> >
> > A great issue to start with would be SYSTEMML-546, which aims to cleanup
> > and extend our existing application tests. This would get you in touch
> with
> > DML and PyDML algorithm scripts as well as the R scripts for comparisons.
> >
> > Regards,
> > Matthias
> >
> > On Sat, Apr 15, 2017 at 2:58 PM, Krishna Kalyan <
> krishnakaly...@gmail.com>
> > wrote:
> >
> > > Hello,
> > > I quite recently applied for GSoC. (Proposal below)  [ Automate
> > performance
> > > testing and reporting]
> > >
> > > https://docs.google.com/document/d/1DKWZTWvrvs73GYa1q3XEN5GFo8ALG
> > > jLH2DrIfRsJksA/edit#
> > >
> > > As part of my effort to understand the codebase, I would like to work
> on
> > > minor/medium issues. Could someone from the community please guide with
> > > JIRAs I could work on during my spare time.
> > > (I am comfortable with Python, R and bash).
> > >
> > > Regards,
> > > Krishna
> > >
> >
> >
> >
> >
>


Re: GSoC : Getting started contributions

2017-04-21 Thread Nakul Jindal
Hi Krishna,

What Arvind is describing is in essence a large part of you GSoC proposal.
You should work on this if and when your proposal gets approved. (we don't
know whether it has been approved and even if we did, we couldn't say).
In the meantime, I encourage you to play around with SystemML, go through
the JIRA site, (look at SYSTEMML-546 as Matthias suggested) and ask
questions on this mailing list that you may have.

Thanks,
Nakul


On Fri, Apr 21, 2017 at 11:00 AM, Arvind Surve 
wrote:

> Hi Krishna,
> There is immediate need for SystemML project to run performance testing
> and analyze results efficiently.Though this is small part of your overall
> GSoC project, its important to start with it and grow from there.In short
> term it will help SystemML project to expedite release cycles and in long
> run you will get head start on the project.
> What we need for short run to get performance testing for every release
> cycle or even beyond.   1. How to setup environment with configurable
> parameters quickly. (We have scripts may need some tweaking or some
> additional configuration)   2. Run performance scripts with configuration
> option of Data size (8GB, 80GB, 8000GB etc or ALL), Different set of
> algorithms regression, classification or all   3. Collect time required to
> run individual algorithm for a given size. and store it in CSV or any
> suitable format file for further processing.   4. Compare results obtained
> from step 3 to previous runs (previous release, RC etc)   5. Generate
> report indicating, Failures scenarios, Outliers (time taken was more than
> tolerance level (say x%), and Successful cases.Each to be separated
> out so that reading those reports will be easy.
>
>  Arvind Surve | Spark Technology Center  | http://www.spark.tc/
>
>   From: Matthias Boehm 
>  To: dev@systemml.incubator.apache.org
>  Sent: Saturday, April 15, 2017 3:27 PM
>  Subject: Re: GSoC : Getting started contributions
>
> A great issue to start with would be SYSTEMML-546, which aims to cleanup
> and extend our existing application tests. This would get you in touch with
> DML and PyDML algorithm scripts as well as the R scripts for comparisons.
>
> Regards,
> Matthias
>
> On Sat, Apr 15, 2017 at 2:58 PM, Krishna Kalyan 
> wrote:
>
> > Hello,
> > I quite recently applied for GSoC. (Proposal below)  [ Automate
> performance
> > testing and reporting]
> >
> > https://docs.google.com/document/d/1DKWZTWvrvs73GYa1q3XEN5GFo8ALG
> > jLH2DrIfRsJksA/edit#
> >
> > As part of my effort to understand the codebase, I would like to work on
> > minor/medium issues. Could someone from the community please guide with
> > JIRAs I could work on during my spare time.
> > (I am comfortable with Python, R and bash).
> >
> > Regards,
> > Krishna
> >
>
>
>
>


Re: Vector of Matrix

2017-04-21 Thread Matthias Boehm

no, right now, we don't support structs or complex objects.

Regards,
Matthias

On 4/21/2017 4:17 AM, arijit chakraborty wrote:

Hi,


In R (as well as in python), we can store values list within list. Say I've 2 
matrix with different dimensions,

x <- matrix(1:10, ncol=2)
y <- matrix(1:5, ncol=1)


FinalList <- c(x, y)


Is it possible to do such form in systemML? I'm not looking for cbind or rbind.


Thank you!

Arijit



Re: Table

2017-04-21 Thread Matthias Boehm
The input vectors to table are interpreted as row indexes and column 
indexes, respectively. Without weights, we add 1, otherwise the 
corresponding weight value to the output cells.


So in your example you have constant row indexes of 1 but a seq(1,10) 
for column indexes and hence you get a 1x10 output matrix.



Regards,
Matthias

On 4/21/2017 4:00 AM, arijit chakraborty wrote:

Hi,


I was trying to understand what the "table" function does. In the documents, it 
says:


"Returns the contingency table of two vectors A and B. The resulting table F 
consists of max(A) rows and max(B) columns."


Suppose I've 2 matrix A and B of this form:


A = matrix(1, 1, 10)
B = matrix(seq(10,1, -1), 10, 1)


I should have a matrix of form

C = matrix(1, 10, 1).


But I'm getting

C  = matrix(1, 1, 10).


Also, what is the difference between "Group form" and "Scalar form"?


Thank you!

Arijit



Re: Questions about the Compositions of Execution Time

2017-04-21 Thread Mingyang Wang
That's awesome!

Can I take it as when utilizing these super-sparse permutation matrices, it
is usually better to store them as column vectors and then dynamically
expand them via table()? Currently, all such FK matrices are stored as
sparse matrices in binary format.

Also, as pmm operator only supports selection, I want to confirm that if FK
would be used multiple times, say, in an iterative algorithm, it is still
better to use dynamic expansion in each iteration rather than materializing
it beforehand (which is simply to reduce read overhead), right?

My cluster has been temporally down and sorry I cannot compare these
scenarios right now.


Best,
Mingyang

On Fri, Apr 21, 2017 at 12:30 AM Matthias Boehm 
wrote:

> Hi Mingyang,
>
> just out of curiosity, I did a quick experiment with the discussed
> alternative formulation for scenario 1 with the following script
>
> R = read($1)
> S = read($2)
> FK = read($3)
>
> wS = Rand(rows=ncol(S), cols=1, min=0, max=1, pdf="uniform")
> wR = Rand(rows=ncol(R), cols=1, min=0, max=1, pdf="uniform")
> temp = S %*% wS + table(seq(1,nrow(FK)),FK,nrow(FK),1e6) %*% (R %*% wR)
> if(1==1){}
> print(sum(temp))
>
> and after two additional improvements (SYSTEMML-1550 and SYSTEMML-1551), I
> got the following - now reasonable - results:
>
> Total elapsed time: 7.928 sec.
> Total compilation time: 1.802 sec.
> Total execution time:   6.126 sec.
>
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 7/0/0/3.
> Cache writes (WB, FS, HDFS):5/0/0.
> Cache times (ACQr/m, RLS, EXP): 2.621/0.001/0.511/0.000 sec.
>
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time:0.000 sec.
> Total JIT compile time: 9.656 sec.
> Total JVM GC count: 3.
> Total JVM GC time:  1.601 sec.
>
> Heavy hitter instructions (name, time, count):
> -- 1)   ba+*3.052 sec   3
> -- 2)   rexpand 2.790 sec   1
> -- 3)   uak+0.169 sec   1
> -- 4)   +   0.095 sec   1
> -- 5)   rand0.017 sec   2
> -- 6)   print   0.001 sec   1
> -- 7)   ==  0.001 sec   1
> -- 8)   createvar   0.000 sec   10
> -- 9)   rmvar   0.000 sec   11
>
> -- 10)  assignvar   0.000 sec   1
>
> There is still some potential because we should compile a permutation
> matrix multiply (pmm) instead of materializing this intermediate but this
> pmm operator currently only supports selection but no permutation matrices.
> Thanks again for catching this performance issue.
>
> Regards,
> Matthias
>
> On Thu, Apr 20, 2017 at 11:44 AM, Matthias Boehm 
> wrote:
>
>> 1) Understanding execution plans: Our local bufferpool reads matrices in
>> a lazy manner on the first singlenode, i.e., CP, operation that tries to
>> pin the matrix into memory. Similarly, distributed matrices are read into
>> aggregated memory on the first Spark instruction. Hence, you can
>> differentiate these different scenarios by following the data dependencies,
>> i.e., what kind of instructions use the particular matrix. Spark checkpoint
>> instructions are a good indicator too but there are special cases where
>> they will not exist.
>>
>> 2) Forcing computation: I typically use 'if(1==1){}' to create a
>> statement block cut (and thus a DAG cut) and subsequently simply a
>> 'print(sum(temp))' because we apply most algebraic rewrites only within the
>> scope of individual statement blocks.
>>
>> 3) Permutation matrices: If FK has a single entry of value 1 per row, you
>> could store it as a column vector with FK2 = rowIndexMax(FK) and
>> subsequently reconstruct it via FK = table(seq(1,nrow(FK2)), FK2,
>> nrow(FK2), N), for which we will compile a dedicated operator that does row
>> expansions. You don't necessarily need the last two argument which only
>> ensure padding and thus matching dimensions for the subsequent matrix
>> multiplication.
>>
>>
>> Regards,
>> Matthias
>>
>> On 4/20/2017 11:05 AM, Mingyang Wang wrote:
>>
>>> Hi Matthias,
>>>
>>> Thanks for your thorough explanations! And I have some other questions.
>>>
>>> 1. I am curious about the behaviors of the read operation within
>>> createvar.
>>> How can I differentiate whether the inputs are loaded in the driver
>>> memory
>>> or loaded in executors? Can I assume the inputs are loaded in executors
>>> if
>>> a Spark checkpoint instruction is invoked?
>>>
>>> 2. I am also curious how do you put a sum operation in a different DAG?
>>> Currently, I put a "print one entry" instruction within a for loop, is it
>>> sufficient to trigger the whole matrix multiplication without some
>>> shortcuts like a dot product between a row and a column? At least, from
>>> the
>>> HOP explains, the whole matrix multiplication is scheduled.
>>>
>>> 3. About generating a "specific" sparse matrix in SystemML. Say, I need a
>>> sparse matrix of 200,000,000 x 10,000,000 and there is 

Vector of Matrix

2017-04-21 Thread arijit chakraborty
Hi,


In R (as well as in python), we can store values list within list. Say I've 2 
matrix with different dimensions,

x <- matrix(1:10, ncol=2)
y <- matrix(1:5, ncol=1)


FinalList <- c(x, y)


Is it possible to do such form in systemML? I'm not looking for cbind or rbind.


Thank you!

Arijit


Table

2017-04-21 Thread arijit chakraborty
Hi,


I was trying to understand what the "table" function does. In the documents, it 
says:


"Returns the contingency table of two vectors A and B. The resulting table F 
consists of max(A) rows and max(B) columns."


Suppose I've 2 matrix A and B of this form:


A = matrix(1, 1, 10)
B = matrix(seq(10,1, -1), 10, 1)


I should have a matrix of form

C = matrix(1, 10, 1).


But I'm getting

C  = matrix(1, 1, 10).


Also, what is the difference between "Group form" and "Scalar form"?


Thank you!

Arijit


Re: Questions about the Compositions of Execution Time

2017-04-21 Thread Matthias Boehm
Hi Mingyang,

just out of curiosity, I did a quick experiment with the discussed
alternative formulation for scenario 1 with the following script

R = read($1)
S = read($2)
FK = read($3)
wS = Rand(rows=ncol(S), cols=1, min=0, max=1, pdf="uniform")
wR = Rand(rows=ncol(R), cols=1, min=0, max=1, pdf="uniform")
temp = S %*% wS + table(seq(1,nrow(FK)),FK,nrow(FK),1e6) %*% (R %*% wR)
if(1==1){}
print(sum(temp))

and after two additional improvements (SYSTEMML-1550 and SYSTEMML-1551), I
got the following - now reasonable - results:

Total elapsed time: 7.928 sec.
Total compilation time: 1.802 sec.
Total execution time:   6.126 sec.
Number of compiled MR Jobs: 0.
Number of executed MR Jobs: 0.
Cache hits (Mem, WB, FS, HDFS): 7/0/0/3.
Cache writes (WB, FS, HDFS):5/0/0.
Cache times (ACQr/m, RLS, EXP): 2.621/0.001/0.511/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:0.000 sec.
Total JIT compile time: 9.656 sec.
Total JVM GC count: 3.
Total JVM GC time:  1.601 sec.
Heavy hitter instructions (name, time, count):
-- 1)   ba+*3.052 sec   3
-- 2)   rexpand 2.790 sec   1
-- 3)   uak+0.169 sec   1
-- 4)   +   0.095 sec   1
-- 5)   rand0.017 sec   2
-- 6)   print   0.001 sec   1
-- 7)   ==  0.001 sec   1
-- 8)   createvar   0.000 sec   10
-- 9)   rmvar   0.000 sec   11
-- 10)  assignvar   0.000 sec   1

There is still some potential because we should compile a permutation
matrix multiply (pmm) instead of materializing this intermediate but this
pmm operator currently only supports selection but no permutation matrices.
Thanks again for catching this performance issue.

Regards,
Matthias

On Thu, Apr 20, 2017 at 11:44 AM, Matthias Boehm 
wrote:

> 1) Understanding execution plans: Our local bufferpool reads matrices in a
> lazy manner on the first singlenode, i.e., CP, operation that tries to pin
> the matrix into memory. Similarly, distributed matrices are read into
> aggregated memory on the first Spark instruction. Hence, you can
> differentiate these different scenarios by following the data dependencies,
> i.e., what kind of instructions use the particular matrix. Spark checkpoint
> instructions are a good indicator too but there are special cases where
> they will not exist.
>
> 2) Forcing computation: I typically use 'if(1==1){}' to create a statement
> block cut (and thus a DAG cut) and subsequently simply a 'print(sum(temp))'
> because we apply most algebraic rewrites only within the scope of
> individual statement blocks.
>
> 3) Permutation matrices: If FK has a single entry of value 1 per row, you
> could store it as a column vector with FK2 = rowIndexMax(FK) and
> subsequently reconstruct it via FK = table(seq(1,nrow(FK2)), FK2,
> nrow(FK2), N), for which we will compile a dedicated operator that does row
> expansions. You don't necessarily need the last two argument which only
> ensure padding and thus matching dimensions for the subsequent matrix
> multiplication.
>
>
> Regards,
> Matthias
>
> On 4/20/2017 11:05 AM, Mingyang Wang wrote:
>
>> Hi Matthias,
>>
>> Thanks for your thorough explanations! And I have some other questions.
>>
>> 1. I am curious about the behaviors of the read operation within
>> createvar.
>> How can I differentiate whether the inputs are loaded in the driver memory
>> or loaded in executors? Can I assume the inputs are loaded in executors if
>> a Spark checkpoint instruction is invoked?
>>
>> 2. I am also curious how do you put a sum operation in a different DAG?
>> Currently, I put a "print one entry" instruction within a for loop, is it
>> sufficient to trigger the whole matrix multiplication without some
>> shortcuts like a dot product between a row and a column? At least, from
>> the
>> HOP explains, the whole matrix multiplication is scheduled.
>>
>> 3. About generating a "specific" sparse matrix in SystemML. Say, I need a
>> sparse matrix of 200,000,000 x 10,000,000 and there is exactly one
>> non-zero
>> value in each row (the position could be random). Is there any efficient
>> way to do it? Currently, I am generating such matrix externally in text
>> format, and it cannot be easily converted to binary format with a simple
>> read/write script (it took quite a long time and failed).
>>
>>
>> Regards,
>> Mingyang
>>
>> On Thu, Apr 20, 2017 at 2:08 AM Matthias Boehm 
>> wrote:
>>
>> Hi Mingyang,
>>>
>>> thanks for the questions - this is very valuable feedback. I was able to
>>> reproduce your performance issue on scenario 1 and I have a patch, which
>>> I'll push to master tomorrow after a more thorough testing. Below are the
>>> details and the answers to your questions:
>>>
>>> 1) Expected performance and bottlenecks: In general, for these single
>>> operation scripts, the read is indeed the expected bottleneck. However,
>>> excessive GC is usually an