Hi Sean, So basically I am trying to cluster a number of elements (its a domain object called PItem) based on a the quality factors of these items. These elements have 112 quality factors each.
Now the issue is that when I am scaling the factors using StandardScaler I get a Sum of Squared Errors = 13300 When I don't use scaling the Sum of Squared Errors = 5 I was always of the opinion that different factors being on different scale should always be normalized, but I am confused based on the results above and I am wondering what factors should be removed to get a meaningful result (may be with 5% less accuracy) Will appreciate any help here. -Rohit On Tue, Aug 9, 2016 at 12:55 PM, Sean Owen <so...@cloudera.com> wrote: > Fewer features doesn't necessarily mean better predictions, because indeed > you are subtracting data. It might, because when done well you subtract > more noise than signal. It is usually done to make data sets smaller or > more tractable or to improve explainability. > > But you have an unsupervised clustering problem where talking about > feature importance doesnt make as much sense. Important to what? There is > no target variable. > > PCA will not 'improve' clustering per se but can make it faster. > You may want to specify what you are actually trying to optimize. > > > On Tue, Aug 9, 2016, 03:23 Rohit Chaddha <rohitchaddha1...@gmail.com> > wrote: > >> I would rather have less features to make better inferences on the data >> based on the smaller number of factors, >> Any suggestions Sean ? >> >> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <so...@cloudera.com> wrote: >> >>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you >>> really want to select features or just obtain a lower-dimensional >>> representation of them, with less redundancy? >>> >>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane <tonylane....@gmail.com> >>> wrote: >>> > There must be an algorithmic way to figure out which of these factors >>> > contribute the least and remove them in the analysis. >>> > I am hoping same one can throw some insight on this. >>> > >>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kuma...@me.com> >>> wrote: >>> >> >>> >> Not an expert here, but the first step would be devote some time and >>> >> identify which of these 112 factors are actually causative. Some >>> domain >>> >> knowledge of the data may be required. Then, you can start of with >>> PCA. >>> >> >>> >> HTH, >>> >> >>> >> Regards, >>> >> >>> >> Sivakumaran S >>> >> >>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane....@gmail.com> wrote: >>> >> >>> >> Great question Rohit. I am in my early days of ML as well and it >>> would be >>> >> great if we get some idea on this from other experts on this group. >>> >> >>> >> I know we can reduce dimensions by using PCA, but i think that does >>> not >>> >> allow us to understand which factors from the original are we using >>> in the >>> >> end. >>> >> >>> >> - Tony L. >>> >> >>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha < >>> rohitchaddha1...@gmail.com> >>> >> wrote: >>> >>> >>> >>> >>> >>> I have a data-set where each data-point has 112 factors. >>> >>> >>> >>> I want to remove the factors which are not relevant, and say reduce >>> to 20 >>> >>> factors out of these 112 and then do clustering of data-points using >>> these >>> >>> 20 factors. >>> >>> >>> >>> How do I do these and how do I figure out which of the 20 factors are >>> >>> useful for analysis. >>> >>> >>> >>> I see SVD and PCA implementations, but I am not sure if these give >>> which >>> >>> elements are removed and which are remaining. >>> >>> >>> >>> Can someone please help me understand what to do here >>> >>> >>> >>> thanks, >>> >>> -Rohit >>> >>> >>> >> >>> >> >>> > >>> >> >>