Best range of parameters for grid search?

2016-08-24 Thread Adamantios Corais
I would like to run a naive implementation of grid search with MLlib but 
I am a bit confused about choosing the 'best' range of parameters. 
Apparently, I do not want to waste too much resources for a combination 
of parameters that will probably not give a better mode. Any suggestions 
from your experience?


val intercept   : List[Boolean]  = List(false)
val classes : List[Int]  = List(2)
val validate: List[Boolean]  = List(true)
val tolerance   : List[Double]   = List(0.001 , 0.01 , 0.1 , 
0.0001 , 0.001 , 0.01 , 0.1 , 1.0)
val gradient: List[Gradient] = List(new LogisticGradient() , new 
LeastSquaresGradient() , new HingeGradient())

val corrections : List[Int]  = List(5 , 10 , 15)
val iters   : List[Int]  = List(1 , 10 , 100 , 1000 , 1)
val regparam: List[Double]   = List(0.0 , 0.0001 , 0.001 , 0.01 , 
0.1 , 1.0 , 10.0 , 100.0)
val updater : List[Updater]  = List(new SimpleUpdater() , new 
L1Updater() , new SquaredL2Updater())


val combinations = for (a <- intercept;
b <- classes;
c <- validate;
d <- tolerance;
e <- gradient;
f <- corrections;
g <- iters;
h <- regparam;
i <- updater) yield (a,b,c,d,e,f,g,h,i)

for( ( interceptS , classesS , validateS , toleranceS , gradientS , 
correctionsS , itersS , regParamS , updaterS ) <- combinations.take(3) ) {


  val lr : LogisticRegressionWithLBFGS = new 
LogisticRegressionWithLBFGS().

setIntercept(addIntercept=interceptS).
setNumClasses(numClasses=classesS).
setValidateData(validateData=validateS)

  lr.
optimizer.
setConvergenceTol(tolerance=toleranceS).
setGradient(gradient=gradientS).
setNumCorrections(corrections=correctionsS).
setNumIterations(iters=itersS).
setRegParam(regParam=regParamS).
setUpdater(updater=updaterS)

}


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Adamantios Corais

Great.

I like your second solution. But how can I make sure that cvModel holds 
the best model overall (as opposed to the last one that was tired out 
but the grid search)?


In addition, do you have an idea how to collect the average error of 
each grid search (here 1x1x1)?




On 12/08/2016 08:59 μμ, Bryan Cutler wrote:
You will need to cast bestModel to include the MLWritable trait.  The 
class Model does not mix it in by default.  For instance:


cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path")

Alternatively, you could save the CV model directly, which takes care 
of this


cvModel.save("/my/path")

On Fri, Aug 12, 2016 at 9:17 AM, Adamantios Corais 
mailto:adamantios.cor...@gmail.com>> wrote:


Hi,

Assuming that I have run the following pipeline and have got the
best logistic regression model. How can I then save that model for
later use? The following command throws an error:

cvModel.bestModel.save("/my/path")

Also, is it possible to get the error (a collection of) for each
combination of parameters?

I am using spark 1.6.2

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml
<http://org.apache.spark.ml>.classification.LogisticRegression
import org.apache.spark.ml
<http://org.apache.spark.ml>.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder , CrossValidator}

val lr = new LogisticRegression()

val pipeline = new Pipeline().
setStages(Array(lr))

val paramGrid = new ParamGridBuilder().
addGrid(lr.elasticNetParam , Array(0.1)).
addGrid(lr.maxIter , Array(10)).
addGrid(lr.regParam , Array(0.1)).
build()

val cv = new CrossValidator().
setEstimator(pipeline).
setEvaluator(new BinaryClassificationEvaluator).
setEstimatorParamMaps(paramGrid).
setNumFolds(2)

val cvModel = cv.
fit(training)


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>






Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Adamantios Corais

Hi,

Assuming that I have run the following pipeline and have got the best logistic 
regression model. How can I then save that model for later use? The following 
command throws an error:

cvModel.bestModel.save("/my/path")

Also, is it possible to get the error (a collection of) for each combination of 
parameters?

I am using spark 1.6.2

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder , CrossValidator}

val lr = new LogisticRegression()

val pipeline = new Pipeline().
setStages(Array(lr))

val paramGrid = new ParamGridBuilder().
addGrid(lr.elasticNetParam , Array(0.1)).
addGrid(lr.maxIter , Array(10)).
addGrid(lr.regParam , Array(0.1)).
build()

val cv = new CrossValidator().
setEstimator(pipeline).
setEvaluator(new BinaryClassificationEvaluator).
setEstimatorParamMaps(paramGrid).
setNumFolds(2)

val cvModel = cv.
fit(training)


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
Thanks Sean. As far as I can see probabilities are NOT normalized;
denominator isn't implemented in either v1.1.0 or v1.5.0 (by denominator,
I refer to the probability of feature X). So, for given lambda, how to
compute the denominator? FYI:
https://github.com/apache/spark/blob/v1.5.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala


*// Adamantios*



On Thu, Sep 10, 2015 at 7:03 PM, Sean Owen  wrote:

> The log probabilities are unlikely to be very large, though the
> probabilities may be very small. The direct answer is to exponentiate
> brzPi + brzTheta * testData.toBreeze -- apply exp(x).
>
> I have forgotten whether the probabilities are normalized already
> though. If not you'll have to normalize to get them to sum to 1 and be
> real class probabilities. This is better done in log space though.
>
> On Thu, Sep 10, 2015 at 5:12 PM, Adamantios Corais
>  wrote:
> > great. so, provided that model.theta represents the log-probabilities and
> > (hence the result of brzPi + brzTheta * testData.toBreeze is a big number
> > too), how can I get back the non-log-probabilities which - apparently -
> are
> > bounded between 0.0 and 1.0?
> >
> >
> > // Adamantios
> >
> >
> >
> > On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen  wrote:
> >>
> >> (pedantic: it's the log-probabilities)
> >>
> >> On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang 
> wrote:
> >> > Actually
> >> > brzPi + brzTheta * testData.toBreeze
> >> > is the probabilities of the input Vector on each class, however it's a
> >> > Breeze Vector.
> >> > Pay attention the index of this Vector need to map to the
> corresponding
> >> > label index.
> >> >
> >> > 2015-08-28 20:38 GMT+08:00 Adamantios Corais
> >> > :
> >> >>
> >> >> Hi,
> >> >>
> >> >> I am trying to change the following code so as to get the
> probabilities
> >> >> of
> >> >> the input Vector on each class (instead of the class itself with the
> >> >> highest
> >> >> probability). I know that this is already available as part of the
> most
> >> >> recent release of Spark but I have to use Spark 1.1.0.
> >> >>
> >> >> Any help is appreciated.
> >> >>
> >> >>> override def predict(testData: Vector): Double = {
> >> >>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
> >> >>>   }
> >> >>
> >> >>
> >> >>>
> >> >>>
> >> >>>
> https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
> >> >>
> >> >>
> >> >> // Adamantios
> >> >>
> >> >>
> >> >
> >
> >
>


Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
great. so, provided that *model.theta* represents the log-probabilities and
(hence the result of *brzPi + brzTheta * testData.toBreeze* is a big number
too), how can I get back the *non-*log-probabilities which - apparently -
are bounded between *0.0 and 1.0*?



*// Adamantios*



On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen  wrote:

> (pedantic: it's the log-probabilities)
>
> On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang  wrote:
> > Actually
> > brzPi + brzTheta * testData.toBreeze
> > is the probabilities of the input Vector on each class, however it's a
> > Breeze Vector.
> > Pay attention the index of this Vector need to map to the corresponding
> > label index.
> >
> > 2015-08-28 20:38 GMT+08:00 Adamantios Corais <
> adamantios.cor...@gmail.com>:
> >>
> >> Hi,
> >>
> >> I am trying to change the following code so as to get the probabilities
> of
> >> the input Vector on each class (instead of the class itself with the
> highest
> >> probability). I know that this is already available as part of the most
> >> recent release of Spark but I have to use Spark 1.1.0.
> >>
> >> Any help is appreciated.
> >>
> >>> override def predict(testData: Vector): Double = {
> >>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
> >>>   }
> >>
> >>
> >>>
> >>>
> https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
> >>
> >>
> >> // Adamantios
> >>
> >>
> >
>


How to compute the probability of each class in Naive Bayes

2015-08-28 Thread Adamantios Corais
Hi,

I am trying to change the following code so as to get the probabilities of
the input Vector on each class (instead of the class itself with the
highest probability). I know that this is already available as part of the
most recent release of Spark but I have to use Spark 1.1.0.

Any help is appreciated.

override def predict(testData: Vector): Double = {
> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
>   }


https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala



*// Adamantios*


How to determine a good set of parameters for a ML grid search task?

2015-08-28 Thread Adamantios Corais
I have a sparse dataset of size 775946 x 845372. I would like to perform a
grid search in order to tune the parameters of my LogisticRegressionWithSGD
model. I have noticed that the building of each model takes about 300 to
400 seconds. That means that in order to try all possible combinations of
parameters I have to wait for about 24 hours. Most importantly though, I am
not sure if the following combinations make sense at all. So, how should I
pick up those parameters more wisely as well as in a way that I can wait
less time?

  val numIterations = Seq(100 , 500 , 1000 , 5000 , 1 , 5 , 10
> , 50)
>   val stepSizes = Seq(10 , 50 , 100 , 500 , 1000 , 5000 , 1 , 5)
>   val miniBatchFractions = Seq(1.0)
>   val updaters = Seq(new SimpleUpdater , new SquaredL2Updater , new
> L1Updater)


Any advice is appreciated.


*// Adamantios*


Re: How to binarize data in spark

2015-08-07 Thread Adamantios Corais
I have ended up with the following piece of code but is turns out to be
really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=>
(x._1 , x._2.toArray)).map{x=>
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}



*// Adamantios*



On Fri, Aug 7, 2015 at 8:36 AM, Yanbo Liang  wrote:

> I think you want to flatten the 1M products to a vector of 1M elements, of
> course mostly are zero.
> It looks like HashingTF
> <https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf>
> can help you.
>
> 2015-08-07 11:02 GMT+08:00 praveen S :
>
>> Use StringIndexer in MLib1.4 :
>>
>> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html
>>
>> On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais <
>> adamantios.cor...@gmail.com> wrote:
>>
>>> I have a set of data based on which I want to create a classification
>>> model. Each row has the following form:
>>>
>>> user1,class1,product1
>>>> user1,class1,product2
>>>> user1,class1,product5
>>>> user2,class1,product2
>>>> user2,class1,product5
>>>> user3,class2,product1
>>>> etc
>>>
>>>
>>> There are about 1M users, 2 classes, and 1M products. What I would like
>>> to do next is create the sparse vectors (something already supported by
>>> MLlib) BUT in order to apply that function I have to create the dense 
>>> vectors
>>> (with the 0s), first. In other words, I have to binarize my data. What's
>>> the easiest (or most elegant) way of doing that?
>>>
>>>
>>> *// Adamantios*
>>>
>>>
>>>
>>
>


How to binarize data in spark

2015-08-06 Thread Adamantios Corais
I have a set of data based on which I want to create a classification
model. Each row has the following form:

user1,class1,product1
> user1,class1,product2
> user1,class1,product5
> user2,class1,product2
> user2,class1,product5
> user3,class2,product1
> etc


There are about 1M users, 2 classes, and 1M products. What I would like to
do next is create the sparse vectors (something already supported by MLlib)
BUT in order to apply that function I have to create the dense vectors
(with the 0s), first. In other words, I have to binarize my data. What's
the easiest (or most elegant) way of doing that?


*// Adamantios*


Cannot build "learning spark" project

2015-04-06 Thread Adamantios Corais
Hi,

I am trying to build this project
https://github.com/databricks/learning-spark with mvn package.This should
work out of the box but unfortunately it doesn't. In fact, I get the
following error:

mvn pachage -X
> Apache Maven 3.0.5
> Maven home: /usr/share/maven
> Java version: 1.7.0_76, vendor: Oracle Corporation
> Java home: /usr/lib/jvm/java-7-oracle/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "3.13.0-45-generic", arch: "amd64", family:
> "unix"
> [INFO] Error stacktraces are turned on.
> [DEBUG] Reading global settings from /usr/share/maven/conf/settings.xml
> [DEBUG] Reading user settings from /home/adam/.m2/settings.xml
> [DEBUG] Using local repository at /home/adam/.m2/repository
> [DEBUG] Using manager EnhancedLocalRepositoryManager with priority 10 for
> /home/adam/.m2/repository
> [INFO] Scanning for projects...
> [DEBUG] Extension realms for project
> com.oreilly.learningsparkexamples:java:jar:0.0.2: (none)
> [DEBUG] Looking up lifecyle mappings for packaging jar from
> ClassRealm[plexus.core, parent: null]
> [ERROR] The build could not read 1 project -> [Help 1]
> org.apache.maven.project.ProjectBuildingException: Some problems were
> encountered while processing the POMs:
> [ERROR] 'dependencies.dependency.artifactId' for
> org.scalatest:scalatest_${scala.binary.version}:jar with value
> 'scalatest_${scala.binary.version}' does not match a valid id pattern. @
> line 101, column 19
> at
> org.apache.maven.project.DefaultProjectBuilder.build(DefaultProjectBuilder.java:363)
> at org.apache.maven.DefaultMaven.collectProjects(DefaultMaven.java:636)
> at
> org.apache.maven.DefaultMaven.getProjectsForMavenReactor(DefaultMaven.java:585)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:234)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> [ERROR]
> [ERROR]   The project com.oreilly.learningsparkexamples:java:0.0.2
> (/home/adam/learning-spark/learning-spark-master/pom.xml) has 1 error
> [ERROR] 'dependencies.dependency.artifactId' for
> org.scalatest:scalatest_${scala.binary.version}:jar with value
> 'scalatest_${scala.binary.version}' does not match a valid id pattern. @
> line 101, column 19
> [ERROR]
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException


As a further step I would like to know how to build it against DataStax
Enterprise 4.6.2

Any help is appreciated!


*// Adamantios*


Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Adamantios Corais
this doesn't solve my problem... apparently, my problem is that from time
to time I accidentally press CTRL + C (instead of CTRL + ALT + V for
copying commands in the shell) and that results in closing my shell. In
order to solve this I was wondering if I just deactivating the CTRL + C
combination at all! Any ideas?


*// Adamantios*



On Fri, Mar 13, 2015 at 7:37 PM, Marcelo Vanzin  wrote:

> You can type ":quit".
>
> On Fri, Mar 13, 2015 at 10:29 AM, Adamantios Corais
>  wrote:
> > Hi,
> >
> > I want change the default combination of keys that exit the Spark shell
> > (i.e. CTRL + C) to something else, such as CTRL + H?
> >
> > Thank you in advance.
> >
> > // Adamantios
> >
> >
> >
>
>
>
> --
> Marcelo
>


How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Adamantios Corais
Hi,

I want change the default combination of keys that exit the Spark shell
(i.e. CTRL + C) to something else, such as CTRL + H?

Thank you in advance.


*// Adamantios*


Spark (SQL) as OLAP engine

2015-02-03 Thread Adamantios Corais
Hi,

After some research I have decided that Spark (SQL) would be ideal for
building an OLAP engine. My goal is to push aggregated data (to Cassandra
or other low-latency data storage) and then be able to project the results
on a web page (web service). New data will be added (aggregated) once a
day, only. On the other hand, the web service must be able to run some
fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the
results with D3.js. Note that I can already achieve similar speeds while in
REPL mode by caching the data. Therefore, I believe that my problem must be
re-phrased as follows: "How can I automatically cache the data once a day
and make them available on a web service that is capable of running any
Spark or Spark (SQL)  statement in order to plot the results with D3.js?"

Note that I have already some experience in Spark (+Spark SQL) as well as
D3.js but not at all with OLAP engines (at least in their traditional form).

Any ideas or suggestions?


*// Adamantios*


Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread Adamantios Corais
Hi,

I am using Spark 0.9.1 and I am looking for a proper viz tools that
supports that specific version. As far as I have seen all relevant tools
(e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no
mentions about older versions of Spark. Any ideas or suggestions?


*// Adamantios*


Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
Hi again,

As Jimmy said, any thoughts about Luigi and/or any other tools? So far it
seems that Oozie is the best and only choice here. Is that right?

On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain 
wrote:

> I have used Oozie for all our workflows with Spark apps but you will have
> to use a java event as the workflow element.   I am interested in anyones
> experience with Luigi and/or any other tools.
>
>
> On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais <
> adamantios.cor...@gmail.com> wrote:
>
>> I have some previous experience with Apache Oozie while I was developing
>> in Apache Pig. Now, I am working explicitly with Apache Spark and I am
>> looking for a tool with similar functionality. Is Oozie recommended? What
>> about Luigi? What do you use \ recommend?
>>
>
>
>
> --
>
>
> "Nothing under the sun is greater than education. By educating one person
> and sending him/her into the society of his/her generation, we make a
> contribution extending a hundred generations to come."
> -Jigoro Kano, Founder of Judo-
>


which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
I have some previous experience with Apache Oozie while I was developing in
Apache Pig. Now, I am working explicitly with Apache Spark and I am looking
for a tool with similar functionality. Is Oozie recommended? What about
Luigi? What do you use \ recommend?


Re: return probability \ confidence instead of actual class

2014-10-11 Thread Adamantios Corais
Thank you Sean. I'll try to do it externally as you suggested, however, can
you please give me some hints on how to do that? In fact, where can I find
the 1.2 implementation you just mentioned? Thanks!




On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen  wrote:

> Plain old SVMs don't produce an estimate of class probabilities;
> predict_proba() does some additional work to estimate class
> probabilities from the SVM output. Spark does not implement this right
> now.
>
> Spark implements the equivalent of decision_function (the wTx + b bit)
> but does not expose it, and instead gives you predict(), which gives 0
> or 1 depending on whether the decision function exceeds the specified
> threshold.
>
> Yes you can roll your own just like you did to calculate the decision
> function from weights and intercept. I suppose it would be nice to
> expose it (do I hear a PR?) but it's not hard to do externally. You'll
> have to do this anyway if you're on anything earlier than 1.2.
>
> On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais
>  wrote:
> > ok let me rephrase my question once again. python-wise I am preferring
> > .predict_proba(X) instead of .decision_function(X) since it is easier
> for me
> > to interpret the results. as far as I can see, the latter functionality
> is
> > already implemented in Spark (well, in version 0.9.2 for example I have
> to
> > compute the dot product on my own otherwise I get 0 or 1) but the former
> is
> > not implemented (yet!). what should I do \ how to implement that one in
> > Spark as well? what are the required inputs here and how does the formula
> > look like?
> >
> > On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen  wrote:
> >>
> >> It looks like you are directly computing the SVM decision function in
> >> both cases:
> >>
> >> val predictions2 = m_users_double.map{point=>
> >>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
> >> }.cache()
> >>
> >> clf.decision_function(T)
> >>
> >> This does not give you +1/-1 in SVMs (well... not for most points,
> >> which will be outside the margin around the separating hyperplane).
> >>
> >> You can use the predict() function in SVMModel -- which will give you
> >> 0 or 1 (rather than +/- 1 but that's just differing convention)
> >> depending on the sign of the decision function. I don't know if this
> >> was in 0.9.
> >>
> >> At the moment I assume you saw small values of the decision function
> >> in scikit because of the radial basis function.
>


Re: return probability \ confidence instead of actual class

2014-10-08 Thread Adamantios Corais
ok let me rephrase my question once again. python-wise I am preferring
.predict_proba(X) instead of .decision_function(X) since it is easier for
me to interpret the results. as far as I can see, the latter functionality
is already implemented in Spark (well, in version 0.9.2 for example I have
to compute the dot product on my own otherwise I get 0 or 1) but the former
is not implemented (yet!). what should I do \ how to implement that one in
Spark as well? what are the required inputs here and how does the formula
look like?

On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen  wrote:

> It looks like you are directly computing the SVM decision function in
> both cases:
>
> val predictions2 = m_users_double.map{point=>
>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
> }.cache()
>
> clf.decision_function(T)
>
> This does not give you +1/-1 in SVMs (well... not for most points,
> which will be outside the margin around the separating hyperplane).
>
> You can use the predict() function in SVMModel -- which will give you
> 0 or 1 (rather than +/- 1 but that's just differing convention)
> depending on the sign of the decision function. I don't know if this
> was in 0.9.
>
> At the moment I assume you saw small values of the decision function
> in scikit because of the radial basis function.
>
> On Tue, Oct 7, 2014 at 7:45 PM, Sunny Khatri  wrote:
> > Not familiar with scikit SVM implementation ( and I assume you are using
> > linearSVC). To figure out an optimal decision boundary based on the
> scores
> > obtained, you can use an ROC curve varying your thresholds.
> >
>


Re: return probability \ confidence instead of actual class

2014-10-07 Thread Adamantios Corais
Well, apparently, the above Python set-up is wrong. Please consider the
following set-up which DOES use 'linear' kernel... And the question remains
the same: how to interpret Spark results (or why Spark results are NOT
bounded between -1 and 1)?

On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri  wrote:

> One diff I can find is you may have different kernel functions for your
> training, In Spark, you end up using Linear Kernel whereas for scikit you
> are using rbk kernel. That can explain the different in the coefficients
> you are getting.
>
> On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais <
> adamantios.cor...@gmail.com> wrote:
>
>> Hi again,
>>
>> Finally, I found the time to play around with your suggestions.
>> Unfortunately, I noticed some unusual behavior in the MLlib results, which
>> is more obvious when I compare them against their scikit-learn equivalent.
>> Note that I am currently using spark 0.9.2. Long story short: I find it
>> difficult to interpret the result: scikit-learn SVM always returns a value
>> between 0 and 1 which makes it easy for me to set-up a threshold in order
>> to keep only the most significant classifications (this is the case for
>> both short and long input vectors). On the other hand, Spark MLlib makes it
>> impossible to interpret the results; results are hardly ever bounded
>> between -1 and +1 and hence it is impossible to choose a good cut-off value
>> - results are of no practical use. And here is the strangest thing ever:
>> although - it seems that - MLlib does NOT generate the right weights and
>> intercept, when I feed the MLlib with the weights and intercept from
>> scikit-learn the results become pretty accurate Any ideas about what is
>> happening? Any suggestion is highly appreciated.
>>
>> PS: to make thinks easier I have quoted both of my implantations as well
>> as results, bellow.
>>
>> //
>>
>> SPARK (short input):
>> training_error: Double = 0.0
>> res2: Array[Double] = Array(-1.4420684459128205E-19,
>> -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749,
>> 0.7498, 0.7498, 0.7498)
>>
>> SPARK (long input):
>> training_error: Double = 0.0
>> res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
>> -0.782207630902241, 0.9522394329769612, 2.6866864968561632,
>> 2.6866864968561632, 2.6866864968561632)
>>
>> PYTHON (short input):
>> array([[-1.0001],
>>[-1.0001],
>>[-1.0001],
>>[-0.],
>>[ 1.0001],
>>[ 1.0001],
>>[ 1.0001]])
>>
>> PYTHON (long input):
>> array([[-1.0001],
>>[-1.0001],
>>[-1.0001],
>>[-0.],
>>[ 1.0001],
>>[ 1.0001],
>>[ 1.0001]])
>>
>> //
>>
>> import analytics.MSC
>>
>> import java.util.Calendar
>> import java.text.SimpleDateFormat
>> import scala.collection.mutable
>> import scala.collection.JavaConversions._
>> import org.apache.spark.SparkContext._
>> import org.apache.spark.mllib.classification.SVMWithSGD
>> import org.apache.spark.mllib.regression.LabeledPoint
>> import org.apache.spark.mllib.optimization.L1Updater
>> import com.datastax.bdp.spark.connector.CassandraConnector
>> import com.datastax.bdp.spark.SparkContextCassandraFunctions._
>>
>> val sc = MSC.sc
>> val lg = MSC.logger
>>
>> //val s_users_double_2 = Seq(
>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>> //  (1.0,Seq(1.0, 1.0, 1.0)),
>> //  (1.0,Seq(1.0, 1.0, 1.0)),
>> //  (1.0,Seq(1.0, 1.0, 1.0))
>> //)
>> val s_users_double_2 = Seq(
>> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>&g

Re: return probability \ confidence instead of actual class

2014-10-06 Thread Adamantios Corais
5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5],
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
]

clf = svm.SVC()
clf.fit(X, Y)
svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
clf.decision_function(T)

///




On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri  wrote:

> For multi-class you can use the same SVMWithSGD (for binary
> classification) with One-vs-All approach constructing respective training
> corpuses consisting one Class i as positive samples and Rest of the classes
> as negative one, and then use the same method provided by Aris as a measure
> of how far Class i is from the decision boundary.
>
> On Wed, Sep 24, 2014 at 4:06 PM, Aris  wrote:
>
>> Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου..
>>
>> Just to follow up on Liquan, you might be interested in removing the
>> thresholds, and then treating the predictions as a probability from 0..1
>> inclusive. SVM with the linear kernel is a straightforward linear
>> classifier -- so you with the model.clearThreshold() you can just get the
>> raw predicted scores, removing the threshold which simple translates that
>> into a positive/negative class.
>>
>> API is here
>> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>
>> Enjoy!
>> Aris
>>
>> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei  wrote:
>>
>>> HI Adamantios,
>>>
>>> For your first question, after you train the SVM, you get a model with a
>>> vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
>>> and w.dot(x) + b = -1 are points that on the decision boundary. The
>>> quantity w.dot(x) + b for point x is a confidence measure of
>>> classification.
>>>
>>> Code wise, suppose you trained your model via
>>> val model = SVMWithSGD.train(...)
>>>
>>> and you can set a threshold by calling
>>>
>>> model.setThreshold(your threshold here)
>>>
>>> to set the threshold that separate positive predictions from negative
>>> predictions.
>>>
>>> For more info, please take a look at
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>
>>> For your second question, SVMWithSGD only supports binary
>>> classification.
>>>
>>> Hope this helps,
>>>
>>> Liquan
>>>
>>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
>>> adamantios.cor...@gmail.com> wrote:
>>>
>>>> Nobody?
>>>>
>>>> If that's not supported already, can please, at least, give me a few
>>>> hints on how to implement it?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>>>> adamantios.cor...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am working with the SVMWithSGD classification algorithm on Spark. It
>>>>> works fine for me, however, I would like to recognize the instances that
>>>>> are classified with a high confidence from those with a low one. How do we
>>>>> define the threshold here? Ultimately, I want to keep only those for which
>>>>> the algorithm is very *very* certain about its its decision! How to do
>>>>> that? Is this feature supported already by any MLlib algorithm? What if I
>>>>> had multiple categories?
>>>>>
>>>>> Any input is highly appreciated!
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Liquan Pei
>>> Department of Physics
>>> University of Massachusetts Amherst
>>>
>>
>>
>


Re: return probability \ confidence instead of actual class

2014-09-21 Thread Adamantios Corais
Nobody?

If that's not supported already, can please, at least, give me a few hints
on how to implement it?

Thanks!


On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
adamantios.cor...@gmail.com> wrote:

> Hi,
>
> I am working with the SVMWithSGD classification algorithm on Spark. It
> works fine for me, however, I would like to recognize the instances that
> are classified with a high confidence from those with a low one. How do we
> define the threshold here? Ultimately, I want to keep only those for which
> the algorithm is very *very* certain about its its decision! How to do
> that? Is this feature supported already by any MLlib algorithm? What if I
> had multiple categories?
>
> Any input is highly appreciated!
>


return probability \ confidence instead of actual class

2014-09-19 Thread Adamantios Corais
Hi,

I am working with the SVMWithSGD classification algorithm on Spark. It
works fine for me, however, I would like to recognize the instances that
are classified with a high confidence from those with a low one. How do we
define the threshold here? Ultimately, I want to keep only those for which
the algorithm is very *very* certain about its its decision! How to do
that? Is this feature supported already by any MLlib algorithm? What if I
had multiple categories?

Any input is highly appreciated!