Thanks a lot Niketan.

It worked. Finally I could create something end to end.

Couple of suggestions -

1. If someway we can make the ID related part transparent (handled by
System ML internally) to end user/data scientists it would be very helpful
for them.
2. API documentation of MLContext is required soon to that people can
understand various nuances of the parameter passing and getting the output
to/from DML script.

Regards,
Sourav



On Thu, Dec 10, 2015 at 10:47 AM, Niketan Pansare <npan...@us.ibm.com>
wrote:

> Hi Sourav,
>
> >>  The first thing I noticed that in the target folder there is no .tar
> files
> for the distribution (like system-ml-0.9.0-SNAPSHOT-distrib.tar.gz). This
> was created previously when I downloaded the previous version form the
> github.
> We added maven profiles in the commit "
> https://github.com/apache/incubator-systemml/commit/3cfb0fb0ada7e6556a74500b33b53508c0309751";.
> Please see the email thread regarding this change:
> https://www.mail-archive.com/dev@systemml.incubator.apache.org/msg00059.html
>
> >> But with that I
> started getting problem the package name. I could run finally the things
> after changing the package structure to org.apache.sysml. Please update the
> documentations accordingly.
> The package renaming was done in the commit "
> https://github.com/apache/incubator-systemml/commit/276d9257c08e667bc70ce49024c6450deb473b43";.
> This was discussed in the email thread
> https://www.mail-archive.com/dev%40systemml.incubator.apache.org/msg00049.html.
> The documentation was updated in the commit
> https://github.com/apache/incubator-systemml/commit/7cd7dc2be83ea73c700b2bebe50e4f37bd275974.
> If we have missed anything, please let us know.
>
> Please feel free to reply back to the above email threads with
> suggestions/criticism.
>
> >> However, when I tried running GLM-predict after adding a new column as
> ID
> the GLM-predict has started failing.
> One possible reason for the error is that you have added "ID" to the
> DataFrame, but did not inform SystemML that ID was inserted. To do that,
> please replace "ml.registerInput("X", predDfIn)" to "ml.registerInput("X",
> predDfIn, true)".
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Sourav Mazumder ---12/10/2015 08:00:51
> AM---Hi Niketan, Thanks for the exaplanation.]Sourav Mazumder
> ---12/10/2015 08:00:51 AM---Hi Niketan, Thanks for the exaplanation.
>
> From: Sourav Mazumder <sourav.mazumde...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/10/2015 08:00 AM
> Subject: Re: Using GLM-predict
> ------------------------------
>
>
>
> Hi Niketan,
>
> Thanks for the exaplanation.
>
> While trying out the new build from github I'm facing issue.
>
> I downloaded the zip from github and rebuilt the package using 'mvn clean
> package'.
>
> The first thing I noticed that in the target folder there is no .tar files
> for the distribution (like system-ml-0.9.0-SNAPSHOT-distrib.tar.gz). This
> was created previously when I downloaded the previous version form the
> github. However I tried system-ml-0.9.0-SNAPSHOT.jar. But with that I
> started getting problem the package name. I could run finally the things
> after changing the package structure to org.apache.sysml. Please update the
> documentations accordingly.
>
> However, when I tried running GLM-predict after adding a new column as ID
> the GLM-predict has started failing.
>
> Here is the code I'm executing -
>
> val beta = outputs.getBinaryBlockedRDD("beta_out")
> val betaMC = outputs.getMatrixCharacteristics("beta_out")
>
> val Xin = sqlContext.sql("select Res_Area, Bldg_Area, Lot_Area, Bldg_Age
> from modeldf")
>
> val predDfIn = RDDConverterUtils.addIDToDataFrame(Xin, sqlContext, "ID")
>
> val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ")
> ml.registerInput("X", predDfIn)
> ml.registerInput("B_full", beta, betaMC)
> ml.registerOutput("means")
>
> val outputsPredict =
> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> cmdLineParamsPredict)
>
> The error is -
>
> org.apache.sysml.runtime.DMLRuntimeException:
> org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in
> program block generated from statement block between lines 122 and 123 --
> Error evaluating instruction:
>
> CP°rangeReIndex°B_full·MATRIX·DOUBLE°1·SCALAR·INT·true°5·SCALAR·INT·true°1·SCALAR·INT·true°1·SCALAR·INT·true°_mVar10563·MATRIX·DOUBLE
> at
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:153)
> at
>
> org.apache.sysml.api.MLContext.executeUsingSimplifiedCompilationChain(MLContext.java:1337)
> at
> org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1203)
> at
> org.apache.sysml.api.MLContext.compileAndExecuteScript(MLContext.java:1149)
> at org.apache.sysml.api.MLContext.execute(MLContext.java:631) at
> org.apache.sysml.api.MLContext.execute(MLContext.java:666) at
> org.apache.sysml.api.MLContext.execute(MLContext.java:679) at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:50)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:52) at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:54) at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56) at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58) at
> $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60) at
> $iwC$$iwC$$iwC$$iwC.<init>(<console>:62) at
> $iwC$$iwC$$iwC.<init>(<console>:64) at $iwC$$iwC.<init>(<console>:66) at
> $iwC.<init>(<console>:68) at <init>(<console>:70) at .<init>(<console>:74)
> at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at
> $print(<console>)
>
> Regards,
> Sourav
>
> On Wed, Dec 9, 2015 at 9:56 PM, Niketan Pansare <npan...@us.ibm.com>
> wrote:
>
> > Hi Sourav,
> >
> > There are two possible options here:
> > 1. If "unique_id" is one-based integer column: In this case, please
> > rename "unique_id" column to ID and use registerInput("X", DF1, true)
> > method.
> >
> > 2. If "unique_id" is anything else (for example: String), then there is
> > no trivial way for SystemML to correlate "string-based unique id" to row
> > index (which is required to interpret a DataFrame into a matrix). This
> > means you have to explicitly add the column ID to DF1:
> > val dataset = RDDConverterUtilsExt.*addIDToDataFrame*(DF1, sqlContext,
> > "ID")
> >
> > When you get DF5 from GLM-predict.dml, you can use following two lines of
> > code which guarantees correct mapping:
> > val DF5 = outNew.getDF(sqlContext, "outPred").withColumnRenamed("C1",
> > "prediction") // Note: there already is a column ID in DF5 which
> > specifies the row index.
> > val output = dataset1.join(pred, dataset1.col("ID").equalTo(pred.col("ID"
> > )))
> >
> > Note: once DataFrame is passed to SystemML via registerInput, SystemML
> > first converts the DataFrame into binary block (i.e.
> > JavaPairRDD<MatrixIndexes, MatrixBlock>) and executes GLM-predict.dml
> using
> > the binary block. After execution, the output is present in MLOutput (
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/MLOutput.java#L89
> )
> > in binary block format. If user choses to, he/she may call getDF(...)
> which
> > does DataFrame to binary block conversion.
> >
> > For DataFrame to binary block conversion, see
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L277
> > ... ordering specified by zipWithIndex (which is also used by
> > RDDConverterUtilsExt.*addIDToDataFrame*)
> > For binary block to DataFrame conversion, see
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L364
> > ... ordering specified by internal binary block format and hence we
> append
> > an extra column ID to specify this ordering.
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> >
> > [image: Inactive hide details for Sourav Mazumder ---12/09/2015 06:20:24
> > PM---Hi Niketan, Thanks again for such a detailed explanation.]Sourav
> > Mazumder ---12/09/2015 06:20:24 PM---Hi Niketan, Thanks again for such a
> > detailed explanation. I see your last point and in
> >
> > From: Sourav Mazumder <sourav.mazumde...@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 12/09/2015 06:20 PM
> > Subject: Re: Using GLM-predict
> > ------------------------------
> >
> >
> >
> > Hi Niketan,
> >
> > Thanks again for such a detailed explanation. I see your last point and
> in
> > agreement with the same. Also I got your point on the use of "means" for
> > gaussian vs other distributions.
> >
> > However, I'm still not convinced about the approach you mentioned for
> > correlating the unique id. I've already tried a code similar to what you
> > sent where I've used the vectorAssembler utility of Spark ML LIb.
> >
> > Let me try to explain the problem with more details -
> >
> > 1. Say my original data frame DF1 is distributed in 3 slave nodes in a
> > Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has a
> > unique identifier column say unique_id.
> > 2. Now I used your code to create the feature vector from DF1 and pass it
> > to GLM-predict. And GLM-predict in turn returns me another data frame
> (say
> > DF5) of "means" (in this case say prediction). However, the rows of DF5
> may
> > be distributed in 4 slave nodes each having say 15 rows. Total 60 rows.
> > 3. Now if I just add this new data frame (DF5) as additional two columns
> to
> > DF1 where is the guarantee that for a specific unique_id of DF1 I'm
> getting
> > right mean/predicted value corresponding to unique_id ?
> >
> > Regards,
> > Sourav
> >
> >
> >
> > On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare <npan...@us.ibm.com>
> > wrote:
> >
> > > Hi Sourav,
> > >
> > > Please see below comments:
> > >
> > > >> I was basically hoping for some sort of API where one can pass the
> > > original
> > > data frame and from that dataframe can specify the columns to be used
> as
> > > feature and the column to be used for label. This model can work well
> for
> > > both creating the model and getting the prediction.
> > > Please use the most recent jar from git. To extract X and Y from your
> > > dataframe without IDs, use following code:
> > > import
> > > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > > val features = Array("lat", "height", "precipitation", "pressure")
> > > val Xmc = new MatrixCharacteristics() // SystemML will set them for you
> > if
> > > the dimensions are unknown
> > > val Ymc = new MatrixCharacteristics()
> > > val X = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc,
> > features)
> > > val Y = RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc,
> > > Array("temperature"))
> > >
> > > If you want to add specific ordering to your DataFrame rows (let's say
> > for
> > > prediction ... in most cases it is not required), use following method:
> > > import
> > > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt
> > > df = RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")
> > >
> > > >> 1. Yes dependent variables are nothing but labels
> > > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> > The
> > > values can be any double number. For example say in a weather data set
> > you
> > > have fields like lat, long, height (from sea level), precipitation,
> > > pressure, temperature. Now one way you can create a model where
> > Temperature
> > > is the dependent variable and other are features (the hypothesis is
> > > Temperature is some function of pressure, precipitation, height,
> latitude
> > > and longitude.
> > > Sorry, in this case, please ignore my earlier suggestion of
> "Prediction =
> > > rowIndexMax(Prob)" as it applies only to classification.
> > > In your case, the returned values are "means" of the distribution
> family
> > > which was used (See
> > >
> >
> http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
> > ).
> > > If Gaussian distribution was used (dfam=1, vpow=0.0), and if the
> problem
> > > was linear and if you expected pointy-hat distribution (i.e. positive
> > > kurtosis), then you can simply return the mean as predicted label. This
> > is
> > > because in case of Gaussian distribution, mean is also the mode. In
> other
> > > case, it might not necessarily be true.
> > >
> > > You may ask why are we making it so complicated and why not just return
> > > the predicted labels instead of probability ?
> > > Well, the problem of labelling is not as simple as it appears and it
> > > highly depends on the problem setting. Let's consider the problem of
> > > multi-class classification and my earlier suggestion "Prediction =
> > > rowIndexMax(Prob)". Also, let the labels be as follows = {cancer, sore
> > > throat, birth defect, fever, normal}. If for a given test example,
> let's
> > > say GLM-predict.dml outputs following probability = {cancer: 0.2, sore
> > > throat: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then
> according
> > > to "Prediction = rowIndexMax(Prob)", we should output the label
> "normal"
> > > and send the patient home ... right ? No. In this case, 20% probability
> > of
> > > cancer is just way too high for a doctor to send the patient home. In
> > this
> > > setting, the doctor might then say to the data scientist: I know that
> > based
> > > on the prevalence of cancer in general public, and based on that domain
> > > knowledge, I suggest that probability over "threshold" should always be
> > > flagged as cancer. Else output the label with highest probability.
> Using
> > > this suggestion, the data scientist modifies the DML as follows:
> > > zeroOneMat = ppred(prob[cancerColID], threshold, ">")
> > > prediction = zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob)
> > >
> > > This also shows the usefulness of "Declarative Machine Learning" :)
> > >
> > > Thanks,
> > >
> > > Niketan Pansare
> > > IBM Almaden Research Center
> > > E-mail: npansar At us.ibm.com
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > >
> > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
> 01:15:30
> > > PM---Hi Niketan, Firstly to answer your Qs -]Sourav Mazumder
> > > ---12/09/2015 01:15:30 PM---Hi Niketan, Firstly to answer your Qs -
> > >
> > > From: Sourav Mazumder <sourav.mazumde...@gmail.com>
> > > To: dev@systemml.incubator.apache.org
> > > Date: 12/09/2015 01:15 PM
> > > Subject: Re: Using GLM-predict
> > > ------------------------------
> > >
> > >
> > >
> > > Hi Niketan,
> > >
> > > Firstly to answer your Qs -
> > >
> > > 1. Yes dependent variables are nothing but labels
> > > 2. The values of the dependent variable are not 1 to totalNumOfClasses.
> > The
> > > values can be any double number. For example say in a weather data set
> > you
> > > have fields like lat, long, height (from sea level), precipitation,
> > > pressure, temperature. Now one way you can create a model where
> > Temperature
> > > is the dependent variable and other are features (the hypothesis is
> > > Temperature is some function of pressure, precipitation, height,
> latitude
> > > and longitude.
> > >
> > > Not sure about the correlation between step 2 and step 3 in your mail.
> In
> > > step 3 does one have to pass 'ID' column (created in step 2) to varName
> > > while calling registerInput(String varName, DataFrame df, containsID) ?
> > >
> > > However the unique Id in typical case can be string. Can't that be used
> > as
> > > is instead ? This means one has to first convert the original unique id
> > to
> > > integer to create an additional unique id column and then again later
> on
> > > that integer unique id has to mapped back.
> > >
> > > I was basically hoping for some sort of API where one can pass the
> > original
> > > data frame and from that dataframe can specify the columns to be used
> as
> > > feature and the column to be used for label. This model can work well
> for
> > > both creating the model and getting the prediction.
> > >
> > > Regards,
> > > Sourav
> > >
> > > On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <npan...@us.ibm.com>
> > > wrote:
> > >
> > > > Hi Sourav,
> > > >
> > > > Couple of questions to make sure we are on same page: does the
> > "dependent
> > > > variable (double)" represents the class labels ? Are the values of
> the
> > > > class labels from 1 to numClasses (i..e one-based) ?
> > > >
> > > > Here are few comments regarding correlating IDs:
> > > >
> > > > To represent an unordered collection (i.e. DataFrame) to an ordered
> > > > collection ("Matrix"), we add special column "ID" which represents
> > > *one-based
> > > > row index*. Please perform following steps:
> > > > 1. Accept recent changes from
> > > https://github.com/apache/incubator-systemml
> > > > and use the generated jar.
> > > >
> > > > 2. Map the unique id in DF1 to int (*1 to number of rows*) and call
> > that
> > > > column 'ID'.
> > > >
> > > > 3. Use the variant of registerInput for both X (both for training and
> > > > predicting) and Y:
> > > > registerInput(String varName, DataFrame df, *b**oolean* containsID)
> > > >
> > > > As a side note: instead of separate double columns, you can represent
> > > them
> > > > using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes,
> > > > MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc,
> > DataFrame
> > > > inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> > > > vectorColumnName) "
> > > >
> > > > Thanks,
> > > >
> > > > Niketan Pansare
> > > > IBM Almaden Research Center
> > > > E-mail: npansar At us.ibm.com
> > > >
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > >
> > > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015
> > 11:15:19
> > > > AM---Hi Niketan, The code you provided works fine. The use of]Sourav
> > > > Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you
> provided
> > > > works fine. The use of getMatrixCharacteristics
> > > >
> > > > From: Sourav Mazumder <sourav.mazumde...@gmail.com>
> > > > To: dev@systemml.incubator.apache.org
> > > > Date: 12/09/2015 11:15 AM
> > > > Subject: Re: Using GLM-predict
> > > > ------------------------------
> > > >
> > > >
> > > >
> > > > Hi Niketan,
> > > >
> > > > The code you provided works fine. The use of getMatrixCharacteristics
> > > > solves the basic execution problem.
> > > >
> > > > However, question #3 is probably not yet unresolved. Let me explain
> the
> > > use
> > > > case scenario I'm trying to build.
> > > >
> > > > 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch
> of
> > > > columns (say 4) which are to be used as features (double), and a
> column
> > > for
> > > > the dependent variable (double).
> > > > 2. When I created the model I created a data frame (DF2) from DF1
> using
> > > > only the feature vectors and pass that as X. And the column with
> > > dependent
> > > > value is passed as Y.
> > > > 3. For calling the GLM-predict I'm using another data frame (DF3) of
> > same
> > > > structure but with different Unique ID (essentially different
> > > > records/rows). From that data frame I'm first creating another data
> > frame
> > > > (DF4) containing the columns representing the features. Then I'm
> > sending
> > > > DF4 to GLM-predict which has only feature vectors.
> > > > 4. The response I get from GLM-predict is the 'means'. Then I'm using
> > the
> > > > inline predict script which returns another data frame {DF5) with ID
> > and
> > > > Predicted values.
> > > >
> > > > The question is how do I correlate the ID I'm getting from DF5 with
> the
> > > > Unique ID of the data frame DF3 ?
> > > >
> > > > Regards,
> > > > Sourav
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <npan...@us.ibm.com>
> > > > wrote:
> > > >
> > > > > Hi Sourav,
> > > > >
> > > > > 1. In the GLM-predict.dml I could see 'means' is the output
> variable.
> > > In
> > > > my
> > > > > understanding it is same as the probability matrix u have mentioned
> > in
> > > > your
> > > > > mail (to be used to compute the prediction). Am I right ?
> > > > > Yes, that's correct.
> > > > >
> > > > > 2. From GLM.dml I get the 'betas' as output using
> > > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > > GLM-predict.dml
> > > > > as B.
> > > > >
> > > > > Can you try this ?
> > > > > // Get output from GLM
> > > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > > val betaMC = outputs.getMatrixCharacteristics("beta_out") // This
> way
> > > you
> > > > > don't have to worry about dimensions.
> > > > > // -----------------------------------------
> > > > > val Xin = DataFrame/RDD of values (or even text/csv file) you want
> to
> > > > > predict
> > > > > // -----------------------------------------
> > > > > // Execute GLM-predict
> > > > > ml.reset()
> > > > > // Please read
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
> > > > > // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
> > > > > val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" ->
> > "...")
> > > > //
> > > > > family of distribution ?
> > > > > ml.registerInput("X", Xin)
> > > > > ml.registerInput("B_full", beta, betaMC)
> > > > > ml.registerOutput("means")
> > > > > val outputsPredict =
> > > > >
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > > cmdLineParamsPredict)
> > > > > val prob = out.getBinaryBlockedRDD("means");
> > > > > val probMC = out.getMatrixCharacteristics("means");
> > > > > // -----------------------------------------
> > > > > // Get predicted label
> > > > > ml.reset()
> > > > > ml.registerInput("Prob",prob, probMC)
> > > > > ml.registerOutput("Prediction")
> > > > > val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\");
> "
> > > > > + "Prediction = rowIndexMax(Prob); "
> > > > > + "write(Prediction, \"tempOut\", \"csv\")")
> > > > > val pred = outputsLabels.getDF(sqlContext,
> > > > > "Prediction").withColumnRenamed("C1", "prediction")
> > > > > // -----------------------------------------
> > > > >
> > > > >
> > > > > 3. Say I get back prediction matrix as an output (from predictions
> =
> > > > > rowIndexMax(means);). Now can I read add that as a column to my
> > > original
> > > > > data frame (the one from which I created the feature vector for the
> > > > > original model) ? My concern is whether adding back will ensure the
> > > right
> > > > > order so that teh key for the feature vector and the predicted
> value
> > > > remain
> > > > > same ? If not how to achieve the same ?
> > > > > In above example 'pred' is a DataFrame with column 'ID' which
> > provides
> > > > the
> > > > > row ID.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Niketan Pansare
> > > > > IBM Almaden Research Center
> > > > > E-mail: npansar At us.ibm.com
> > > > >
> > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > > >
> > > > > [image: Inactive hide details for Sourav Mazumder ---12/08/2015
> > > 10:53:40
> > > > > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav
> > Mazumder
> > > > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the
> detailed
> > > > > inputs.
> > > > >
> > > > > From: Sourav Mazumder <sourav.mazumde...@gmail.com>
> > > > > To: dev@systemml.incubator.apache.org, Niketan
> > > Pansare/Almaden/IBM@IBMUS
> > > > > Date: 12/08/2015 10:53 PM
> > > > > Subject: Re: Using GLM-predict
> > > > > ------------------------------
> > > > >
> > > > >
> > > > >
> > > > > Hi Niketan,
> > > > >
> > > > > Thanks again for the detailed inputs.
> > > > >
> > > > > Some more follow up Qs -
> > > > >
> > > > > 1. In the GLM-predict.dml I could see 'means' is the output
> variable.
> > > In
> > > > my
> > > > > understanding it is same as the probability matrix u have mentioned
> > in
> > > > your
> > > > > mail (to be used to compute the prediction). Am I right ?
> > > > >
> > > > > 2. From GLM.dml I get the 'betas' as output using
> > > > > outputs.getBinaryBlockedRDD("beta_out"). The same I pass to
> > > > GLM-predict.dml
> > > > > as B. For registering B following statements are used
> > > > > val beta = outputs.getBinaryBlockedRDD("beta_out")
> > > > > ml.registerInput("B", beta, 1, 4) // I have four feature vectors
> so I
> > > > get 4
> > > > > coefficients
> > > > >
> > > > > However, when I execute GLM-predict.dml I get following error.
> > > > >
> > > > > val outputs =
> > > > >
> > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
> > > > > cmdLineParams)
> > > > >
> > > > > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd not
> provided
> > > > > 15/12/09 05:32:47 ERROR Expression: ERROR:
> > > > > /home/system-ml-0.9.0-SNAPSHOT/algori
> > > > > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> > > > > dimensio
> > > > > n information in read statement:  .mtd
> > > > > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters :
> ERROR:
> > > > > /home/syste
> > > > > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column
> 8
> > --
> > > > > Miss
> > > > > ing or incomplete dimension information in read statement:  .mtd
> > > > >
> > > > > In line 117 we have following statement : X = read (fileX);
> > > > >
> > > > > 3. Say I get back prediction matrix as an output (from predictions
> =
> > > > > rowIndexMax(means);). Now can I read add that as a column to my
> > > original
> > > > > data frame (the one from which I created the feature vector for the
> > > > > original model) ? My concern is whether adding back will ensure the
> > > right
> > > > > order so that teh key for the feature vector and the predicted
> value
> > > > remain
> > > > > same ? If not how to achieve the same ?
> > > > >
> > > > > Regards,
> > > > > Sourav
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <
> npan...@us.ibm.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Sourav,
> > > > > >
> > > > > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015
> > > 12:56:38
> > > > > > -0800*
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://www.mail-archive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> > > > >
> > > > > "
> > > > > > (which I noticed in the archive).
> > > > > >
> > > > > > >> Not sure how exactly I can modify the GLM-predict.dml to get
> > some
> > > > > > prediction to start with.
> > > > > > There are two options here:
> > > > > > 1. Modify GLM-predict.dml as suggested by Shirish (better
> approach
> > > with
> > > > > > respect to the SystemML optimizer) or
> > > > > >
> > > > > > 2. Run a new script on the output of GLM-predict. Please see:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > > > > > If you chose to go with option 2, you might also want to read the
> > > > > > documentation of following two built-in functions:
> > > > > > a. rowIndexMax (See
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > > > <
> > > > >
> > > >
> > >
> >
> http://apache.github.io/incubator-systemml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-functions
> > > > > >
> > > > > > )
> > > > > > b. ppred
> > > > > >
> > > > > > >> Can you give me some idea how from here I can calculate the
> > > > predicted
> > > > > > value of the label using some value of probability threshold ?
> > > > > > Very simple way to predict the label given probability matrix:
> > > > > > Prediction = rowIndexMax(Prob) # predicts the label with highest
> > > > > > probability. This assumes one-based labels.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Niketan Pansare
> > > > > > IBM Almaden Research Center
> > > > > > E-mail: npansar At us.ibm.com
> > > > > >
> > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> > > > > >
> > > > > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015
> > > > > 12:49:47
> > > > > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the
> > prob]Shirish
> > > > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes,
> > GLM-predict.dml
> > > > > gives
> > > > > > out only the probabilities. You can put a
> > > > > >
> > > > > > From: Shirish Tatikonda <shirish.tatiko...@gmail.com>
> > > > > > To: dev@systemml.incubator.apache.org
> > > > > > Date: 12/08/2015 12:49 PM
> > > > > > Subject: Re: Using GLM-predict
> > > > > > ------------------------------
> > > > > >
> > > > > >
> > > > > >
> > > > > > Hi Sourav,
> > > > > >
> > > > > > Yes, GLM-predict.dml gives out only the probabilities. You can
> put
> > a
> > > > > > threshold on the resulting probabilities to get the actual class
> > > labels
> > > > > --
> > > > > > for example, prob > 0.5 is positive and <=0.5 as negative.
> > > > > >
> > > > > > The exact value of threshold typically depends on the data and
> the
> > > > > > application. Different thresholds yield different classifiers
> with
> > > > > > different performance (precision, recall, etc.). You can find the
> > > best
> > > > > > threshold for the given data set by finding a value that gives
> the
> > > > > desired
> > > > > > classifier performance (for example, a threshold that gives
> roughly
> > > > equal
> > > > > > precision and recall). Such an optimization is obviously done
> > during
> > > > the
> > > > > > training phase using a held out test set.
> > > > > >
> > > > > > If you wish, you can also modify the DML script to perform this
> > > entire
> > > > > > process.
> > > > > >
> > > > > > Shirish
> > > > > >
> > > > > >
> > > > > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> > > > > > sourav.mazumde...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have used GLM.dml to create a model using some sample data.
> It
> > > > > returns
> > > > > > to
> > > > > > > me the matrix of Beta, B.
> > > > > > >
> > > > > > > Now I want to use this matrix of Beta on a new set of data
> points
> > > and
> > > > > > > generate predicted value of the dependent variable/observation.
> > > > > > >
> > > > > > > When I checked GLM-predict, I could see that one can pass
> feature
> > > > > vector
> > > > > > > for the new data set and also the matrix of beta.
> > > > > > >
> > > > > > > But I could not see any way to get the predicted value of the
> > > > dependent
> > > > > > > variable/observation. The output parameter only supports matrix
> > of
> > > > > > > predicted means/probabilities.
> > > > > > >
> > > > > > > Is there a way one can get the predicted value of the dependent
> > > > > > > variable/observation from GLM-predict ?
> > > > > > >
> > > > > > > Regards,
> > > > > > > Sourav
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Reply via email to