Scaling can mean scaling factors up or down so that they're all on a
comparable scale. It certainly changes the sum of squared errors, but,
you can't compare this metric across scaled and unscaled data, exactly
because one is on a totally different scale and will have quite
different absolute
Hi Sean,
So basically I am trying to cluster a number of elements (its a domain
object called PItem) based on a the quality factors of these items.
These elements have 112 quality factors each.
Now the issue is that when I am scaling the factors using StandardScaler I
get a Sum of Squared Errors
Fewer features doesn't necessarily mean better predictions, because indeed
you are subtracting data. It might, because when done well you subtract
more noise than signal. It is usually done to make data sets smaller or
more tractable or to improve explainability.
But you have an unsupervised
Another approach is to use L1 regularisation eg
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression.
This adds a penalty term to the regression equation to reduce model
complexity. When you use L1 (as opposed to say L2) this tends to
@Peyman - does any of the clustering algorithms have "feature Importance"
or "feature selection" ability ? I can't seem to pinpoint
On Tue, Aug 9, 2016 at 8:49 AM, Peyman Mohajerian
wrote:
> You can try 'feature Importances' or 'feature selection' depending on what
> else
You can try 'feature Importances' or 'feature selection' depending on what
else you want to do with the remaining features that's a possibility. Let's
say you are trying to do classification then some of the Spark Libraries
have a model parameter called 'featureImportances' that tell you which
I would rather have less features to make better inferences on the data
based on the smaller number of factors,
Any suggestions Sean ?
On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen wrote:
> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
> really want to select
There must be an algorithmic way to figure out which of these factors
contribute the least and remove them in the analysis.
I am hoping same one can throw some insight on this.
On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S wrote:
> Not an expert here, but the first step
Not an expert here, but the first step would be devote some time and identify
which of these 112 factors are actually causative. Some domain knowledge of the
data may be required. Then, you can start of with PCA.
HTH,
Regards,
Sivakumaran S
> On 08-Aug-2016, at 3:01 PM, Tony Lane
Great question Rohit. I am in my early days of ML as well and it would be
great if we get some idea on this from other experts on this group.
I know we can reduce dimensions by using PCA, but i think that does not
allow us to understand which factors from the original are we using in the
end.
-
I have a data-set where each data-point has 112 factors.
I want to remove the factors which are not relevant, and say reduce to 20
factors out of these 112 and then do clustering of data-points using these
20 factors.
How do I do these and how do I figure out which of the 20 factors are
useful
11 matches
Mail list logo