Re: K Means Clustering Explanation

2018-03-02 Thread Matt Hicks
Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still
having issues determining how to actually accomplish this with the API.
Can anyone point me to an example in code showing how to accomplish this?  





On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando 
alessandro.solima...@gmail.com 
 wrote:
Hi Matt,similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).
For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on this
topic, although I have no pointers to provide).
Hth,Alessandro



On 1 March 2018 at 21:59, Christoph Brücke <carabo...@gmail.com>  wrote:
Hi Matt,
I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your original
input data and the associated cluster id for each data point in the input data.
Now you can group this dataset by cluster id and aggregate over the original 5
features. E.g., get the mean for numerical data or the value that occurs the
most for categorical data.
The exact aggregation is use-case dependent.
I hope this helps,Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" <m...@outr.com>:
Thanks for the response Christoph.
I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to the
original format to properly understand the dominant values in each cluster.
For example, if I have five fields of data and I've trained ten clusters of data
I'd like to output the five fields of data as represented by each of the ten
clusters.  





On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com  wrote:
Hi matt,
the cluster are defined by there centroids / cluster centers. All the points
belonging to a certain cluster are closer to its than to the centroids of any
other cluster.
What I typically do is to convert the cluster centers back to the original input
format or of that is not possible use the point nearest to the cluster center
and use this as a representation of the whole cluster.
Can you be a little bit more specific about your use-case?
Best,Christoph
Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>:
I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

K Means Clustering Explanation

2018-03-01 Thread Matt Hicks
I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-16 Thread Matt Hicks
If I try to use LogisticRegression with only positive training it always gives
me positive results:

Positive Only private def positiveOnly(): Unit = {val 
training = spark.createDataFrame(Seq(  (1.0, Vectors.dense(0.0, 1.1, 0.1)), 
 (1.0, Vectors.dense(0.0, 1.0, -1.0)),  (1.0, Vectors.dense(0.2, 1.3, 
1.0)),  (1.0, Vectors.dense(0.1, 1.2, -0.5)))).toDF("label", 
"features")val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)val model = lr.fit(training)val test 
= spark.createDataFrame(Seq(  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),  
(0.0, Vectors.dense(3.0, 2.0, -0.1)),  (1.0, Vectors.dense(0.0, 2.2, -1.5)) 
   )).toDF("label", "features")model.transform(test)  
.select("features", "label", "probability", "prediction")  .collect()  
.foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: 
Double) =>println(s"($features, $label) -> prob=$prob, 
prediction=$prediction")  }  }


Not using Mixmax yet?  

The results look like this:
[info] ([-1.0,1.5,1.3], 1.0) -> prob=[0.0,1.0], prediction=1.0[info]
([3.0,2.0,-0.1], 0.0) -> prob=[0.0,1.0], prediction=1.0[info] ([0.0,2.2,-1.5],
1.0) -> prob=[0.0,1.0], prediction=1.0  





On Tue, Jan 16, 2018 8:51 AM, Matt Hicks m...@outr.com  wrote:
Hi Hari, I'm not sure I understand.  I apologize, I'm still pretty new to Spark
and Spark ML.  Can you point me to some example code or documentation that would
more fully represent this?
Thanks  





On Tue, Jan 16, 2018 2:54 AM, hosur narahari hnr1...@gmail.com  wrote:
You can make use of probability vector from spark classification.When you run
spark classification model for prediction, along with classifying into its class
spark also gives probability vector(what's the probability that this could
belong to each individual class) . So just take the probability corresponding to
the donor class. And it'll be same as what's the probability the a person will
become donor.
Best Regards,Hari
On 15 Jan 2018 11:51 p.m., "Matt Hicks" <m...@outr.com> wrote:
I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-16 Thread Matt Hicks
Hi Hari, I'm not sure I understand.  I apologize, I'm still pretty new to
Spark and Spark ML.  Can you point me to some example code or documentation that
would more fully represent this?
Thanks  





On Tue, Jan 16, 2018 2:54 AM, hosur narahari hnr1...@gmail.com  wrote:
You can make use of probability vector from spark classification.When you run
spark classification model for prediction, along with classifying into its class
spark also gives probability vector(what's the probability that this could
belong to each individual class) . So just take the probability corresponding to
the donor class. And it'll be same as what's the probability the a person will
become donor.
Best Regards,Hari
On 15 Jan 2018 11:51 p.m., "Matt Hicks" <m...@outr.com> wrote:
I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Matt Hicks
Is it fair to assume this is what I need? https://github.com/ispras/pu4spark  





On Mon, Jan 15, 2018 1:55 PM, Georg Heiler georg.kf.hei...@gmail.com  wrote:
As far as I know spark does not implement such algorithms. In case the dataset
is small
http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html
 might be of interest to you.
Jörn Franke <jornfra...@gmail.com> schrieb am Mo., 15. Jan. 2018 um 20:04 Uhr:
I think you look more for algorithms for unsupervised learning, eg clustering.
Depending on the characteristics different clusters might be created , eg donor
or non-donor. Most likely you may find also more clusters (eg would donate but
has a disease preventing it or too old). You can verify which clusters make
sense for your approach so I recommend not only try two clusters but multiple
and see which number is more statistically significant .
On 15. Jan 2018, at 19:21, Matt Hicks <m...@outr.com> wrote:

I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com




[Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Matt Hicks
I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com