Best range of parameters for grid search?

2016-08-24 Thread Adamantios Corais
I would like to run a naive implementation of grid search with MLlib but I am a bit confused about choosing the 'best' range of parameters. Apparently, I do not want to waste too much resources for a combination of parameters that will probably not give a better mode. Any suggestions from your

Re: Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Adamantios Corais
takes care of this cvModel.save("/my/path") On Fri, Aug 12, 2016 at 9:17 AM, Adamantios Corais <adamantios.cor...@gmail.com <mailto:adamantios.cor...@gmail.com>> wrote: Hi, Assuming that I have run the following pipeline and have got the best logistic regression model

Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Adamantios Corais
Hi, Assuming that I have run the following pipeline and have got the best logistic regression model. How can I then save that model for later use? The following command throws an error: cvModel.bestModel.save("/my/path") Also, is it possible to get the error (a collection of) for each

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
the input Vector on each class, however it's a > > Breeze Vector. > > Pay attention the index of this Vector need to map to the corresponding > > label index. > > > > 2015-08-28 20:38 GMT+08:00 Adamantios Corais < > adamantios.cor...@gmail.com>: > >> >

Re: How to compute the probability of each class in Naive Bayes

2015-09-10 Thread Adamantios Corais
; On Thu, Sep 10, 2015 at 5:12 PM, Adamantios Corais > <adamantios.cor...@gmail.com> wrote: > > great. so, provided that model.theta represents the log-probabilities and > > (hence the result of brzPi + brzTheta * testData.toBreeze is a big number > > too), how

How to determine a good set of parameters for a ML grid search task?

2015-08-28 Thread Adamantios Corais
I have a sparse dataset of size 775946 x 845372. I would like to perform a grid search in order to tune the parameters of my LogisticRegressionWithSGD model. I have noticed that the building of each model takes about 300 to 400 seconds. That means that in order to try all possible combinations of

How to compute the probability of each class in Naive Bayes

2015-08-28 Thread Adamantios Corais
Hi, I am trying to change the following code so as to get the probabilities of the input Vector on each class (instead of the class itself with the highest probability). I know that this is already available as part of the most recent release of Spark but I have to use Spark 1.1.0. Any help is

Re: How to binarize data in spark

2015-08-07 Thread Adamantios Corais
: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: I have a set of data based on which I want to create a classification model. Each row has the following form: user1

How to binarize data in spark

2015-08-06 Thread Adamantios Corais
I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product2 user2,class1,product5 user3,class2,product1 etc There are about 1M users, 2 classes, and 1M

Cannot build learning spark project

2015-04-06 Thread Adamantios Corais
Hi, I am trying to build this project https://github.com/databricks/learning-spark with mvn package.This should work out of the box but unfortunately it doesn't. In fact, I get the following error: mvn pachage -X Apache Maven 3.0.5 Maven home: /usr/share/maven Java version: 1.7.0_76, vendor:

How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Adamantios Corais
Hi, I want change the default combination of keys that exit the Spark shell (i.e. CTRL + C) to something else, such as CTRL + H? Thank you in advance. *// Adamantios*

Spark (SQL) as OLAP engine

2015-02-03 Thread Adamantios Corais
Hi, After some research I have decided that Spark (SQL) would be ideal for building an OLAP engine. My goal is to push aggregated data (to Cassandra or other low-latency data storage) and then be able to project the results on a web page (web service). New data will be added (aggregated) once a

Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread Adamantios Corais
Hi, I am using Spark 0.9.1 and I am looking for a proper viz tools that supports that specific version. As far as I have seen all relevant tools (e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no mentions about older versions of Spark. Any ideas or suggestions? *//

which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
I have some previous experience with Apache Oozie while I was developing in Apache Pig. Now, I am working explicitly with Apache Spark and I am looking for a tool with similar functionality. Is Oozie recommended? What about Luigi? What do you use \ recommend?

Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
will have to use a java event as the workflow element. I am interested in anyones experience with Luigi and/or any other tools. On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: I have some previous experience with Apache Oozie while I was developing

Re: return probability \ confidence instead of actual class

2014-10-11 Thread Adamantios Corais
to do externally. You'll have to do this anyway if you're on anything earlier than 1.2. On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: ok let me rephrase my question once again. python-wise I am preferring .predict_proba(X) instead of .decision_function(X

Re: return probability \ confidence instead of actual class

2014-10-08 Thread Adamantios Corais
ok let me rephrase my question once again. python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to

Re: return probability \ confidence instead of actual class

2014-10-06 Thread Adamantios Corais
, Adamantios Corais adamantios.cor...@gmail.com wrote: Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I am working

Re: return probability \ confidence instead of actual class

2014-09-22 Thread Adamantios Corais
Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine

return probability \ confidence instead of actual class

2014-09-19 Thread Adamantios Corais
Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which