Best range of parameters for grid search?

2016-08-24 Thread Adamantios Corais
I would like to run a naive implementation of grid search with MLlib but 
I am a bit confused about choosing the 'best' range of parameters. 
Apparently, I do not want to waste too much resources for a combination 
of parameters that will probably not give a better mode. Any suggestions 
from your experience?


val intercept   : List[Boolean]  = List(false)
val classes : List[Int]  = List(2)
val validate: List[Boolean]  = List(true)
val tolerance   : List[Double]   = List(0.001 , 0.01 , 0.1 , 
0.0001 , 0.001 , 0.01 , 0.1 , 1.0)
val gradient: List[Gradient] = List(new LogisticGradient() , new 
LeastSquaresGradient() , new HingeGradient())

val corrections : List[Int]  = List(5 , 10 , 15)
val iters   : List[Int]  = List(1 , 10 , 100 , 1000 , 1)
val regparam: List[Double]   = List(0.0 , 0.0001 , 0.001 , 0.01 , 
0.1 , 1.0 , 10.0 , 100.0)
val updater : List[Updater]  = List(new SimpleUpdater() , new 
L1Updater() , new SquaredL2Updater())


val combinations = for (a <- intercept;
b <- classes;
c <- validate;
d <- tolerance;
e <- gradient;
f <- corrections;
g <- iters;
h <- regparam;
i <- updater) yield (a,b,c,d,e,f,g,h,i)

for( ( interceptS , classesS , validateS , toleranceS , gradientS , 
correctionsS , itersS , regParamS , updaterS ) <- combinations.take(3) ) {


  val lr : LogisticRegressionWithLBFGS = new 
LogisticRegressionWithLBFGS().

setIntercept(addIntercept=interceptS).
setNumClasses(numClasses=classesS).
setValidateData(validateData=validateS)

  lr.
optimizer.
setConvergenceTol(tolerance=toleranceS).
setGradient(gradient=gradientS).
setNumCorrections(corrections=correctionsS).
setNumIterations(iters=itersS).
setRegParam(regParam=regParamS).
setUpdater(updater=updaterS)

}


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Adamantios Corais

Great.

I like your second solution. But how can I make sure that cvModel holds 
the best model overall (as opposed to the last one that was tired out 
but the grid search)?


In addition, do you have an idea how to collect the average error of 
each grid search (here 1x1x1)?




On 12/08/2016 08:59 μμ, Bryan Cutler wrote:
You will need to cast bestModel to include the MLWritable trait.  The 
class Model does not mix it in by default.  For instance:


cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path")

Alternatively, you could save the CV model directly, which takes care 
of this


cvModel.save("/my/path")

On Fri, Aug 12, 2016 at 9:17 AM, Adamantios Corais 
<adamantios.cor...@gmail.com <mailto:adamantios.cor...@gmail.com>> wrote:


Hi,

Assuming that I have run the following pipeline and have got the
best logistic regression model. How can I then save that model for
later use? The following command throws an error:

cvModel.bestModel.save("/my/path")

Also, is it possible to get the error (a collection of) for each
combination of parameters?

I am using spark 1.6.2

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml
<http://org.apache.spark.ml>.classification.LogisticRegression
import org.apache.spark.ml
<http://org.apache.spark.ml>.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder , CrossValidator}

val lr = new LogisticRegression()

val pipeline = new Pipeline().
setStages(Array(lr))

val paramGrid = new ParamGridBuilder().
addGrid(lr.elasticNetParam , Array(0.1)).
addGrid(lr.maxIter , Array(10)).
addGrid(lr.regParam , Array(0.1)).
build()

val cv = new CrossValidator().
setEstimator(pipeline).
setEvaluator(new BinaryClassificationEvaluator).
setEstimatorParamMaps(paramGrid).
setNumFolds(2)

val cvModel = cv.
fit(training)


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>






Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Adamantios Corais

Hi,

Assuming that I have run the following pipeline and have got the best logistic 
regression model. How can I then save that model for later use? The following 
command throws an error:

cvModel.bestModel.save("/my/path")

Also, is it possible to get the error (a collection of) for each combination of 
parameters?

I am using spark 1.6.2

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder , CrossValidator}

val lr = new LogisticRegression()

val pipeline = new Pipeline().
setStages(Array(lr))

val paramGrid = new ParamGridBuilder().
addGrid(lr.elasticNetParam , Array(0.1)).
addGrid(lr.maxIter , Array(10)).
addGrid(lr.regParam , Array(0.1)).
build()

val cv = new CrossValidator().
setEstimator(pipeline).
setEvaluator(new BinaryClassificationEvaluator).
setEstimatorParamMaps(paramGrid).
setNumFolds(2)

val cvModel = cv.
fit(training)


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
great. so, provided that *model.theta* represents the log-probabilities and
(hence the result of *brzPi + brzTheta * testData.toBreeze* is a big number
too), how can I get back the *non-*log-probabilities which - apparently -
are bounded between *0.0 and 1.0*?



*// Adamantios*



On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen <so...@cloudera.com> wrote:

> (pedantic: it's the log-probabilities)
>
> On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang <yblia...@gmail.com> wrote:
> > Actually
> > brzPi + brzTheta * testData.toBreeze
> > is the probabilities of the input Vector on each class, however it's a
> > Breeze Vector.
> > Pay attention the index of this Vector need to map to the corresponding
> > label index.
> >
> > 2015-08-28 20:38 GMT+08:00 Adamantios Corais <
> adamantios.cor...@gmail.com>:
> >>
> >> Hi,
> >>
> >> I am trying to change the following code so as to get the probabilities
> of
> >> the input Vector on each class (instead of the class itself with the
> highest
> >> probability). I know that this is already available as part of the most
> >> recent release of Spark but I have to use Spark 1.1.0.
> >>
> >> Any help is appreciated.
> >>
> >>> override def predict(testData: Vector): Double = {
> >>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
> >>>   }
> >>
> >>
> >>>
> >>>
> https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
> >>
> >>
> >> // Adamantios
> >>
> >>
> >
>


Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
Thanks Sean. As far as I can see probabilities are NOT normalized;
denominator isn't implemented in either v1.1.0 or v1.5.0 (by denominator,
I refer to the probability of feature X). So, for given lambda, how to
compute the denominator? FYI:
https://github.com/apache/spark/blob/v1.5.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala


*// Adamantios*



On Thu, Sep 10, 2015 at 7:03 PM, Sean Owen <so...@cloudera.com> wrote:

> The log probabilities are unlikely to be very large, though the
> probabilities may be very small. The direct answer is to exponentiate
> brzPi + brzTheta * testData.toBreeze -- apply exp(x).
>
> I have forgotten whether the probabilities are normalized already
> though. If not you'll have to normalize to get them to sum to 1 and be
> real class probabilities. This is better done in log space though.
>
> On Thu, Sep 10, 2015 at 5:12 PM, Adamantios Corais
> <adamantios.cor...@gmail.com> wrote:
> > great. so, provided that model.theta represents the log-probabilities and
> > (hence the result of brzPi + brzTheta * testData.toBreeze is a big number
> > too), how can I get back the non-log-probabilities which - apparently -
> are
> > bounded between 0.0 and 1.0?
> >
> >
> > // Adamantios
> >
> >
> >
> > On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> (pedantic: it's the log-probabilities)
> >>
> >> On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang <yblia...@gmail.com>
> wrote:
> >> > Actually
> >> > brzPi + brzTheta * testData.toBreeze
> >> > is the probabilities of the input Vector on each class, however it's a
> >> > Breeze Vector.
> >> > Pay attention the index of this Vector need to map to the
> corresponding
> >> > label index.
> >> >
> >> > 2015-08-28 20:38 GMT+08:00 Adamantios Corais
> >> > <adamantios.cor...@gmail.com>:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I am trying to change the following code so as to get the
> probabilities
> >> >> of
> >> >> the input Vector on each class (instead of the class itself with the
> >> >> highest
> >> >> probability). I know that this is already available as part of the
> most
> >> >> recent release of Spark but I have to use Spark 1.1.0.
> >> >>
> >> >> Any help is appreciated.
> >> >>
> >> >>> override def predict(testData: Vector): Double = {
> >> >>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
> >> >>>   }
> >> >>
> >> >>
> >> >>>
> >> >>>
> >> >>>
> https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
> >> >>
> >> >>
> >> >> // Adamantios
> >> >>
> >> >>
> >> >
> >
> >
>


How to determine a good set of parameters for a ML grid search task?

2015-08-28 Thread Adamantios Corais
I have a sparse dataset of size 775946 x 845372. I would like to perform a
grid search in order to tune the parameters of my LogisticRegressionWithSGD
model. I have noticed that the building of each model takes about 300 to
400 seconds. That means that in order to try all possible combinations of
parameters I have to wait for about 24 hours. Most importantly though, I am
not sure if the following combinations make sense at all. So, how should I
pick up those parameters more wisely as well as in a way that I can wait
less time?

  val numIterations = Seq(100 , 500 , 1000 , 5000 , 1 , 5 , 10
 , 50)
   val stepSizes = Seq(10 , 50 , 100 , 500 , 1000 , 5000 , 1 , 5)
   val miniBatchFractions = Seq(1.0)
   val updaters = Seq(new SimpleUpdater , new SquaredL2Updater , new
 L1Updater)


Any advice is appreciated.


*// Adamantios*


How to compute the probability of each class in Naive Bayes

2015-08-28 Thread Adamantios Corais
Hi,

I am trying to change the following code so as to get the probabilities of
the input Vector on each class (instead of the class itself with the
highest probability). I know that this is already available as part of the
most recent release of Spark but I have to use Spark 1.1.0.

Any help is appreciated.

override def predict(testData: Vector): Double = {
 labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
   }


https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala



*// Adamantios*


Re: How to binarize data in spark

2015-08-07 Thread Adamantios Corais
I have ended up with the following piece of code but is turns out to be
really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x= ((x(0) , x(1)) , x(2))).groupByKey().map(x=
(x._1 , x._2.toArray)).map{x=
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y = i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}



*// Adamantios*



On Fri, Aug 7, 2015 at 8:36 AM, Yanbo Liang yblia...@gmail.com wrote:

 I think you want to flatten the 1M products to a vector of 1M elements, of
 course mostly are zero.
 It looks like HashingTF
 https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf
 can help you.

 2015-08-07 11:02 GMT+08:00 praveen S mylogi...@gmail.com:

 Use StringIndexer in MLib1.4 :

 https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html

 On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 I have a set of data based on which I want to create a classification
 model. Each row has the following form:

 user1,class1,product1
 user1,class1,product2
 user1,class1,product5
 user2,class1,product2
 user2,class1,product5
 user3,class2,product1
 etc


 There are about 1M users, 2 classes, and 1M products. What I would like
 to do next is create the sparse vectors (something already supported by
 MLlib) BUT in order to apply that function I have to create the dense 
 vectors
 (with the 0s), first. In other words, I have to binarize my data. What's
 the easiest (or most elegant) way of doing that?


 *// Adamantios*







How to binarize data in spark

2015-08-06 Thread Adamantios Corais
I have a set of data based on which I want to create a classification
model. Each row has the following form:

user1,class1,product1
 user1,class1,product2
 user1,class1,product5
 user2,class1,product2
 user2,class1,product5
 user3,class2,product1
 etc


There are about 1M users, 2 classes, and 1M products. What I would like to
do next is create the sparse vectors (something already supported by MLlib)
BUT in order to apply that function I have to create the dense vectors
(with the 0s), first. In other words, I have to binarize my data. What's
the easiest (or most elegant) way of doing that?


*// Adamantios*


Cannot build learning spark project

2015-04-06 Thread Adamantios Corais
Hi,

I am trying to build this project
https://github.com/databricks/learning-spark with mvn package.This should
work out of the box but unfortunately it doesn't. In fact, I get the
following error:

mvn pachage -X
 Apache Maven 3.0.5
 Maven home: /usr/share/maven
 Java version: 1.7.0_76, vendor: Oracle Corporation
 Java home: /usr/lib/jvm/java-7-oracle/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 3.13.0-45-generic, arch: amd64, family:
 unix
 [INFO] Error stacktraces are turned on.
 [DEBUG] Reading global settings from /usr/share/maven/conf/settings.xml
 [DEBUG] Reading user settings from /home/adam/.m2/settings.xml
 [DEBUG] Using local repository at /home/adam/.m2/repository
 [DEBUG] Using manager EnhancedLocalRepositoryManager with priority 10 for
 /home/adam/.m2/repository
 [INFO] Scanning for projects...
 [DEBUG] Extension realms for project
 com.oreilly.learningsparkexamples:java:jar:0.0.2: (none)
 [DEBUG] Looking up lifecyle mappings for packaging jar from
 ClassRealm[plexus.core, parent: null]
 [ERROR] The build could not read 1 project - [Help 1]
 org.apache.maven.project.ProjectBuildingException: Some problems were
 encountered while processing the POMs:
 [ERROR] 'dependencies.dependency.artifactId' for
 org.scalatest:scalatest_${scala.binary.version}:jar with value
 'scalatest_${scala.binary.version}' does not match a valid id pattern. @
 line 101, column 19
 at
 org.apache.maven.project.DefaultProjectBuilder.build(DefaultProjectBuilder.java:363)
 at org.apache.maven.DefaultMaven.collectProjects(DefaultMaven.java:636)
 at
 org.apache.maven.DefaultMaven.getProjectsForMavenReactor(DefaultMaven.java:585)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:234)
 at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
 at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
 [ERROR]
 [ERROR]   The project com.oreilly.learningsparkexamples:java:0.0.2
 (/home/adam/learning-spark/learning-spark-master/pom.xml) has 1 error
 [ERROR] 'dependencies.dependency.artifactId' for
 org.scalatest:scalatest_${scala.binary.version}:jar with value
 'scalatest_${scala.binary.version}' does not match a valid id pattern. @
 line 101, column 19
 [ERROR]
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please read the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException


As a further step I would like to know how to build it against DataStax
Enterprise 4.6.2

Any help is appreciated!


*// Adamantios*


How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Adamantios Corais
Hi,

I want change the default combination of keys that exit the Spark shell
(i.e. CTRL + C) to something else, such as CTRL + H?

Thank you in advance.


*// Adamantios*


Spark (SQL) as OLAP engine

2015-02-03 Thread Adamantios Corais
Hi,

After some research I have decided that Spark (SQL) would be ideal for
building an OLAP engine. My goal is to push aggregated data (to Cassandra
or other low-latency data storage) and then be able to project the results
on a web page (web service). New data will be added (aggregated) once a
day, only. On the other hand, the web service must be able to run some
fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the
results with D3.js. Note that I can already achieve similar speeds while in
REPL mode by caching the data. Therefore, I believe that my problem must be
re-phrased as follows: How can I automatically cache the data once a day
and make them available on a web service that is capable of running any
Spark or Spark (SQL)  statement in order to plot the results with D3.js?

Note that I have already some experience in Spark (+Spark SQL) as well as
D3.js but not at all with OLAP engines (at least in their traditional form).

Any ideas or suggestions?


*// Adamantios*


Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread Adamantios Corais
Hi,

I am using Spark 0.9.1 and I am looking for a proper viz tools that
supports that specific version. As far as I have seen all relevant tools
(e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no
mentions about older versions of Spark. Any ideas or suggestions?


*// Adamantios*


which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
I have some previous experience with Apache Oozie while I was developing in
Apache Pig. Now, I am working explicitly with Apache Spark and I am looking
for a tool with similar functionality. Is Oozie recommended? What about
Luigi? What do you use \ recommend?


Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
Hi again,

As Jimmy said, any thoughts about Luigi and/or any other tools? So far it
seems that Oozie is the best and only choice here. Is that right?

On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain jimmy.mcerl...@gmail.com
wrote:

 I have used Oozie for all our workflows with Spark apps but you will have
 to use a java event as the workflow element.   I am interested in anyones
 experience with Luigi and/or any other tools.


 On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 I have some previous experience with Apache Oozie while I was developing
 in Apache Pig. Now, I am working explicitly with Apache Spark and I am
 looking for a tool with similar functionality. Is Oozie recommended? What
 about Luigi? What do you use \ recommend?




 --


 Nothing under the sun is greater than education. By educating one person
 and sending him/her into the society of his/her generation, we make a
 contribution extending a hundred generations to come.
 -Jigoro Kano, Founder of Judo-



Re: return probability \ confidence instead of actual class

2014-10-11 Thread Adamantios Corais
Thank you Sean. I'll try to do it externally as you suggested, however, can
you please give me some hints on how to do that? In fact, where can I find
the 1.2 implementation you just mentioned? Thanks!




On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen so...@cloudera.com wrote:

 Plain old SVMs don't produce an estimate of class probabilities;
 predict_proba() does some additional work to estimate class
 probabilities from the SVM output. Spark does not implement this right
 now.

 Spark implements the equivalent of decision_function (the wTx + b bit)
 but does not expose it, and instead gives you predict(), which gives 0
 or 1 depending on whether the decision function exceeds the specified
 threshold.

 Yes you can roll your own just like you did to calculate the decision
 function from weights and intercept. I suppose it would be nice to
 expose it (do I hear a PR?) but it's not hard to do externally. You'll
 have to do this anyway if you're on anything earlier than 1.2.

 On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais
 adamantios.cor...@gmail.com wrote:
  ok let me rephrase my question once again. python-wise I am preferring
  .predict_proba(X) instead of .decision_function(X) since it is easier
 for me
  to interpret the results. as far as I can see, the latter functionality
 is
  already implemented in Spark (well, in version 0.9.2 for example I have
 to
  compute the dot product on my own otherwise I get 0 or 1) but the former
 is
  not implemented (yet!). what should I do \ how to implement that one in
  Spark as well? what are the required inputs here and how does the formula
  look like?
 
  On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen so...@cloudera.com wrote:
 
  It looks like you are directly computing the SVM decision function in
  both cases:
 
  val predictions2 = m_users_double.map{point=
point.zip(weights).map(a= a._1 * a._2).sum + intercept
  }.cache()
 
  clf.decision_function(T)
 
  This does not give you +1/-1 in SVMs (well... not for most points,
  which will be outside the margin around the separating hyperplane).
 
  You can use the predict() function in SVMModel -- which will give you
  0 or 1 (rather than +/- 1 but that's just differing convention)
  depending on the sign of the decision function. I don't know if this
  was in 0.9.
 
  At the moment I assume you saw small values of the decision function
  in scikit because of the radial basis function.



Re: return probability \ confidence instead of actual class

2014-10-08 Thread Adamantios Corais
ok let me rephrase my question once again. python-wise I am preferring
.predict_proba(X) instead of .decision_function(X) since it is easier for
me to interpret the results. as far as I can see, the latter functionality
is already implemented in Spark (well, in version 0.9.2 for example I have
to compute the dot product on my own otherwise I get 0 or 1) but the former
is not implemented (yet!). what should I do \ how to implement that one in
Spark as well? what are the required inputs here and how does the formula
look like?

On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen so...@cloudera.com wrote:

 It looks like you are directly computing the SVM decision function in
 both cases:

 val predictions2 = m_users_double.map{point=
   point.zip(weights).map(a= a._1 * a._2).sum + intercept
 }.cache()

 clf.decision_function(T)

 This does not give you +1/-1 in SVMs (well... not for most points,
 which will be outside the margin around the separating hyperplane).

 You can use the predict() function in SVMModel -- which will give you
 0 or 1 (rather than +/- 1 but that's just differing convention)
 depending on the sign of the decision function. I don't know if this
 was in 0.9.

 At the moment I assume you saw small values of the decision function
 in scikit because of the radial basis function.

 On Tue, Oct 7, 2014 at 7:45 PM, Sunny Khatri sunny.k...@gmail.com wrote:
  Not familiar with scikit SVM implementation ( and I assume you are using
  linearSVC). To figure out an optimal decision boundary based on the
 scores
  obtained, you can use an ROC curve varying your thresholds.
 



Re: return probability \ confidence instead of actual class

2014-10-06 Thread Adamantios Corais
, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
]

clf = svm.SVC()
clf.fit(X, Y)
svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
clf.decision_function(T)

///




On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri sunny.k...@gmail.com wrote:

 For multi-class you can use the same SVMWithSGD (for binary
 classification) with One-vs-All approach constructing respective training
 corpuses consisting one Class i as positive samples and Rest of the classes
 as negative one, and then use the same method provided by Aris as a measure
 of how far Class i is from the decision boundary.

 On Wed, Sep 24, 2014 at 4:06 PM, Aris arisofala...@gmail.com wrote:

 Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου..

 Just to follow up on Liquan, you might be interested in removing the
 thresholds, and then treating the predictions as a probability from 0..1
 inclusive. SVM with the linear kernel is a straightforward linear
 classifier -- so you with the model.clearThreshold() you can just get the
 raw predicted scores, removing the threshold which simple translates that
 into a positive/negative class.

 API is here
 http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

 Enjoy!
 Aris

 On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei liquan...@gmail.com wrote:

 HI Adamantios,

 For your first question, after you train the SVM, you get a model with a
 vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
 and w.dot(x) + b = -1 are points that on the decision boundary. The
 quantity w.dot(x) + b for point x is a confidence measure of
 classification.

 Code wise, suppose you trained your model via
 val model = SVMWithSGD.train(...)

 and you can set a threshold by calling

 model.setThreshold(your threshold here)

 to set the threshold that separate positive predictions from negative
 predictions.

 For more info, please take a look at
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

 For your second question, SVMWithSGD only supports binary
 classification.

 Hope this helps,

 Liquan

 On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Nobody?

 If that's not supported already, can please, at least, give me a few
 hints on how to implement it?

 Thanks!


 On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Hi,

 I am working with the SVMWithSGD classification algorithm on Spark. It
 works fine for me, however, I would like to recognize the instances that
 are classified with a high confidence from those with a low one. How do we
 define the threshold here? Ultimately, I want to keep only those for which
 the algorithm is very *very* certain about its its decision! How to do
 that? Is this feature supported already by any MLlib algorithm? What if I
 had multiple categories?

 Any input is highly appreciated!





 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst






Re: return probability \ confidence instead of actual class

2014-09-22 Thread Adamantios Corais
Nobody?

If that's not supported already, can please, at least, give me a few hints
on how to implement it?

Thanks!


On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais 
adamantios.cor...@gmail.com wrote:

 Hi,

 I am working with the SVMWithSGD classification algorithm on Spark. It
 works fine for me, however, I would like to recognize the instances that
 are classified with a high confidence from those with a low one. How do we
 define the threshold here? Ultimately, I want to keep only those for which
 the algorithm is very *very* certain about its its decision! How to do
 that? Is this feature supported already by any MLlib algorithm? What if I
 had multiple categories?

 Any input is highly appreciated!



return probability \ confidence instead of actual class

2014-09-19 Thread Adamantios Corais
Hi,

I am working with the SVMWithSGD classification algorithm on Spark. It
works fine for me, however, I would like to recognize the instances that
are classified with a high confidence from those with a low one. How do we
define the threshold here? Ultimately, I want to keep only those for which
the algorithm is very *very* certain about its its decision! How to do
that? Is this feature supported already by any MLlib algorithm? What if I
had multiple categories?

Any input is highly appreciated!