Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

Robin East Mon, 08 Aug 2016 22:07:53 -0700

Another approach is to use L1 regularisation eg 
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression.
 This adds a penalty term to the regression equation to reduce model 
complexity. When you use L1 (as opposed to say L2) this tends to promote 
sparsity in the coefficients i.e.some of the coefficients are pushed to zero, 
effectively deselecting them from the model.


Sent from my iPhone

> On 9 Aug 2016, at 04:19, Peyman Mohajerian <mohaj...@gmail.com> wrote:
> 
> You can try 'feature Importances' or 'feature selection' depending on what 
> else you want to do with the remaining features that's a possibility. Let's 
> say you are trying to do classification then some of the Spark Libraries have 
> a model parameter called 'featureImportances' that tell you which feature(s) 
> are more dominant in you classification, you can then run your model again 
> with the smaller set of features. 
> The two approaches are quite different, what I'm suggesting involves training 
> (supervised learning) in the context of a target function, with SVD you are 
> doing unsupervised learning.
> 
>> On Mon, Aug 8, 2016 at 7:23 PM, Rohit Chaddha <rohitchaddha1...@gmail.com> 
>> wrote:
>> I would rather have less features to make better inferences on the data 
>> based on the smaller number of factors, 
>> Any suggestions Sean ? 
>> 
>>> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <so...@cloudera.com> wrote:
>>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>>> really want to select features or just obtain a lower-dimensional
>>> representation of them, with less redundancy?
>>> 
>>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane <tonylane....@gmail.com> wrote:
>>> > There must be an algorithmic way to figure out which of these factors
>>> > contribute the least and remove them in the analysis.
>>> > I am hoping same one can throw some insight on this.
>>> >
>>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kuma...@me.com> wrote:
>>> >>
>>> >> Not an expert here, but the first step would be devote some time and
>>> >> identify which of these 112 factors are actually causative. Some domain
>>> >> knowledge of the data may be required. Then, you can start of with PCA.
>>> >>
>>> >> HTH,
>>> >>
>>> >> Regards,
>>> >>
>>> >> Sivakumaran S
>>> >>
>>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane....@gmail.com> wrote:
>>> >>
>>> >> Great question Rohit.  I am in my early days of ML as well and it would 
>>> >> be
>>> >> great if we get some idea on this from other experts on this group.
>>> >>
>>> >> I know we can reduce dimensions by using PCA, but i think that does not
>>> >> allow us to understand which factors from the original are we using in 
>>> >> the
>>> >> end.
>>> >>
>>> >> - Tony L.
>>> >>
>>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha 
>>> >> <rohitchaddha1...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>> I have a data-set where each data-point has 112 factors.
>>> >>>
>>> >>> I want to remove the factors which are not relevant, and say reduce to 
>>> >>> 20
>>> >>> factors out of these 112 and then do clustering of data-points using 
>>> >>> these
>>> >>> 20 factors.
>>> >>>
>>> >>> How do I do these and how do I figure out which of the 20 factors are
>>> >>> useful for analysis.
>>> >>>
>>> >>> I see SVD and PCA implementations, but I am not sure if these give which
>>> >>> elements are removed and which are remaining.
>>> >>>
>>> >>> Can someone please help me understand what to do here
>>> >>>
>>> >>> thanks,
>>> >>> -Rohit
>>> >>>
>>> >>
>>> >>
>>> >
>

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

Reply via email to