Hi Niketan,

The code you provided works fine. The use of getMatrixCharacteristics
solves the basic execution problem.

However, question #3 is probably not yet unresolved. Let me explain the use
case scenario I'm trying to build.

1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of
columns (say 4) which are to be used as features (double), and a column for
the dependent variable (double).
2. When I created the model I created a data frame (DF2) from DF1 using
only the feature vectors and pass that as X. And the column with dependent
value is passed as Y.
3. For calling the GLM-predict I'm using another data frame (DF3) of same
structure but with different Unique ID (essentially different
records/rows). From that data frame I'm first creating another data frame
(DF4) containing the columns representing the features. Then I'm sending
DF4 to GLM-predict which has only feature vectors.
4. The response I get from GLM-predict is the 'means'. Then I'm using the
inline predict script which returns another data frame {DF5) with ID and
Predicted values.

The question is how do I correlate the ID I'm getting from DF5 with the
Unique ID of the data frame DF3 ?

Regards,
Sourav




On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <npan...@us.ibm.com> wrote:

> Hi Sourav,
>
> 1. In the GLM-predict.dml I could see 'means' is the output variable. In my
> understanding it is same as the probability matrix u have mentioned in your
> mail (to be used to compute the prediction). Am I right ?
> Yes, that's correct.
>
> 2. From GLM.dml I get the 'betas' as output using
> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml
> as B.
>
> Can you try this ?
> // Get output from GLM
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you
> don't have to worry about dimensions.
> // -----------------------------------------
> val Xin = DataFrame/RDD of values (or even text/csv file) you want to
> predict
> // -----------------------------------------
> // Execute GLM-predict
> ml.reset()
> // Please read
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" -> "...") //
> family of distribution ?
> ml.registerInput("X", Xin)
> ml.registerInput("B_full", beta, betaMC)
> ml.registerOutput("means")
> val outputsPredict =
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> cmdLineParamsPredict)
> val prob = out.getBinaryBlockedRDD("means");
> val probMC = out.getMatrixCharacteristics("means");
> // -----------------------------------------
> // Get predicted label
> ml.reset()
> ml.registerInput("Prob",prob, probMC)
> ml.registerOutput("Prediction")
> val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
> + "Prediction = rowIndexMax(Prob); "
> + "write(Prediction, \"tempOut\", \"csv\")")
> val pred = outputsLabels.getDF(sqlContext,
> "Prediction").withColumnRenamed("C1", "prediction")
> // -----------------------------------------
>
>
> 3. Say I get back prediction matrix as an output (from predictions =
> rowIndexMax(means);). Now can I read add that as a column to my original
> data frame (the one from which I created the feature vector for the
> original model) ? My concern is whether adding back will ensure the right
> order so that teh key for the feature vector and the predicted value remain
> same ? If not how to achieve the same ?
> In above example 'pred' is a DataFrame with column 'ID' which provides the
> row ID.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/08/2015 10:53:40
> PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder
> ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed
> inputs.
>
> From: Sourav Mazumder <sourav.mazumde...@gmail.com>
> To: dev@systemml.incubator.apache.org, Niketan Pansare/Almaden/IBM@IBMUS
> Date: 12/08/2015 10:53 PM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Thanks again for the detailed inputs.
>
> Some more follow up Qs -
>
> 1. In the GLM-predict.dml I could see 'means' is the output variable. In my
> understanding it is same as the probability matrix u have mentioned in your
> mail (to be used to compute the prediction). Am I right ?
>
> 2. From GLM.dml I get the 'betas' as output using
> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml
> as B. For registering B following statements are used
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I get 4
> coefficients
>
> However, when I execute GLM-predict.dml I get following error.
>
> val outputs =
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> cmdLineParams)
>
> 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not provided
> 15/12/09 05:32:47 ERROR Expression: ERROR:
> /home/system-ml-0.9.0-SNAPSHOT/algori
> thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> dimensio
> n information in read statement:  .mtd
> com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
> /home/syste
> m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
> Miss
> ing or incomplete dimension information in read statement:  .mtd
>
> In line 117 we have following statement : X = read (fileX);
>
> 3. Say I get back prediction matrix as an output (from predictions =
> rowIndexMax(means);). Now can I read add that as a column to my original
> data frame (the one from which I created the feature vector for the
> original model) ? My concern is whether adding back will ensure the right
> order so that teh key for the feature vector and the predicted value remain
> same ? If not how to achieve the same ?
>
> Regards,
> Sourav
>
>
>
>
>
> On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <npan...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> > -0800*
> > <
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208>
> "
> > (which I noticed in the archive).
> >
> > >> Not sure how exactly I can modify the GLM-predict.dml to get some
> > prediction to start with.
> > There are two options here:
> > 1. Modify GLM-predict.dml as suggested by Shirish (better approach with
> > respect to the SystemML optimizer) or
> >
> > 2. Run a new script on the output of GLM-predict. Please see:
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > If you chose to go with option 2, you might also want to read the
> > documentation of following two built-in functions:
> > a. rowIndexMax (See
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > <
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> >
> > )
> > b. ppred
> >
> > >> Can you give me some idea how from here I can calculate the predicted
> > value of the label using some value of probability threshold ?
> > Very simple way to predict the label given probability matrix:
> > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > probability. This assumes one-based labels.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> 12:49:47
> > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish
> > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml
> gives
> > out only the probabilities. You can put a
> >
> > From: Shirish Tatikonda <shirish.tatiko...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/08/2015 12:49 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Sourav,
> >
> > Yes, GLM-predict.dml gives out only the probabilities. You can put a
> > threshold on the resulting probabilities to get the actual class labels
> --
> > for example, prob > 0.5 is positive and <=0.5 as negative.
> >
> > The exact value of threshold typically depends on the data and the
> > application. Different thresholds yield different classifiers with
> > different performance (precision, recall, etc.). You can find the best
> > threshold for the given data set by finding a value that gives the
> desired
> > classifier performance (for example, a threshold that gives roughly equal
> > precision and recall). Such an optimization is obviously done during the
> > training phase using a held out test set.
> >
> > If you wish, you can also modify the DML script to perform this entire
> > process.
> >
> > Shirish
> >
> >
> > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > sourav.mazumde...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I have used GLM.dml to create a model using some sample data. It
> returns
> > to
> > > me the matrix of Beta, B.
> > >
> > > Now I want to use this matrix of Beta on a new set of data points and
> > > generate predicted value of the dependent variable/observation.
> > >
> > > When I checked GLM-predict, I could see that one can pass feature
> vector
> > > for the new data set and also the matrix of beta.
> > >
> > > But I could not see any way to get the predicted value of the dependent
> > > variable/observation. The output parameter only supports matrix of
> > > predicted means/probabilities.
> > >
> > > Is there a way one can get the predicted value of the dependent
> > > variable/observation from GLM-predict ?
> > >
> > > Regards,
> > > Sourav
> > >
> >
> >
> >
>
>
>

Reply via email to