Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-10 Thread Sean Owen
Scaling can mean scaling factors up or down so that they're all on a comparable scale. It certainly changes the sum of squared errors, but, you can't compare this metric across scaled and unscaled data, exactly because one is on a totally different scale and will have quite different absolute

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-10 Thread Rohit Chaddha
Hi Sean, So basically I am trying to cluster a number of elements (its a domain object called PItem) based on a the quality factors of these items. These elements have 112 quality factors each. Now the issue is that when I am scaling the factors using StandardScaler I get a Sum of Squared Errors

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-09 Thread Sean Owen
Fewer features doesn't necessarily mean better predictions, because indeed you are subtracting data. It might, because when done well you subtract more noise than signal. It is usually done to make data sets smaller or more tractable or to improve explainability. But you have an unsupervised

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Robin East
Another approach is to use L1 regularisation eg http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression. This adds a penalty term to the regression equation to reduce model complexity. When you use L1 (as opposed to say L2) this tends to

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha
@Peyman - does any of the clustering algorithms have "feature Importance" or "feature selection" ability ? I can't seem to pinpoint On Tue, Aug 9, 2016 at 8:49 AM, Peyman Mohajerian wrote: > You can try 'feature Importances' or 'feature selection' depending on what > else

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Peyman Mohajerian
You can try 'feature Importances' or 'feature selection' depending on what else you want to do with the remaining features that's a possibility. Let's say you are trying to do classification then some of the Spark Libraries have a model parameter called 'featureImportances' that tell you which

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha
I would rather have less features to make better inferences on the data based on the smaller number of factors, Any suggestions Sean ? On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen wrote: > Yes, that's exactly what PCA is for as Sivakumaran noted. Do you > really want to select

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Tony Lane
There must be an algorithmic way to figure out which of these factors contribute the least and remove them in the analysis. I am hoping same one can throw some insight on this. On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S wrote: > Not an expert here, but the first step

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Sivakumaran S
Not an expert here, but the first step would be devote some time and identify which of these 112 factors are actually causative. Some domain knowledge of the data may be required. Then, you can start of with PCA. HTH, Regards, Sivakumaran S > On 08-Aug-2016, at 3:01 PM, Tony Lane

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Tony Lane
Great question Rohit. I am in my early days of ML as well and it would be great if we get some idea on this from other experts on this group. I know we can reduce dimensions by using PCA, but i think that does not allow us to understand which factors from the original are we using in the end. -

Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha
I have a data-set where each data-point has 112 factors. I want to remove the factors which are not relevant, and say reduce to 20 factors out of these 112 and then do clustering of data-points using these 20 factors. How do I do these and how do I figure out which of the 20 factors are useful