Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Great thanks both of you. I was struggling with this issue as well. -Rohit On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar wrote: > Thanks Nick. I also ran into this issue. > VG, One workaround is to drop the NaN from predictions (df.na.drop()) and > then use the dataset for the evaluator. In real life, probably detect the > NaN and recommend most popular on some window. > HTH. > Cheers > > > On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath > wrote: > >> It seems likely that you're running into >> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the >> test dataset in the train/test split contains users or items that were not >> in the training set. Hence the model doesn't have computed factors for >> those ids, and ALS 'transform' currently returns NaN for those ids. This in >> turn results in NaN for the evaluator result. >> >> I have a PR open on that issue that will hopefully address this soon. >> >> >> On Sun, 24 Jul 2016 at 17:49 VG wrote: >> >>> ping. Anyone has some suggestions/advice for me . >>> It will be really helpful. >>> >>> VG >>> On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: >>> Sean, I did this just to test the model. When I do a split of my data as training to 80% and test to be 20% I get a Root-mean-square error = NaN So I am wondering where I might be going wrong Regards, VG On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: > No, that's certainly not to be expected. ALS works by computing a much > lower-rank representation of the input. It would not reproduce the > input exactly, and you don't want it to -- this would be seriously > overfit. This is why in general you don't evaluate a model on the > training set. > > On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: > > I am trying to run ml.ALS to compute some recommendations. > > > > Just to test I am using the same dataset for training using ALSModel > and for > > predicting the results based on the model . > > > > When I evaluate the result using RegressionEvaluator I get a > > Root-mean-square error = 1.5544064263236066 > > > > I thin this should be 0. Any suggestions what might be going wrong. > > > > Regards, > > Vipul > >
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Hi Krishna, Great .. I had no idea about this. I tried your suggestion by using na.drop() and got a rmse = 1.5794048211812495 Any suggestions how this can be reduced and the model improved ? Regards, Rohit On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar wrote: > Thanks Nick. I also ran into this issue. > VG, One workaround is to drop the NaN from predictions (df.na.drop()) and > then use the dataset for the evaluator. In real life, probably detect the > NaN and recommend most popular on some window. > HTH. > Cheers > > > On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath > wrote: > >> It seems likely that you're running into >> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the >> test dataset in the train/test split contains users or items that were not >> in the training set. Hence the model doesn't have computed factors for >> those ids, and ALS 'transform' currently returns NaN for those ids. This in >> turn results in NaN for the evaluator result. >> >> I have a PR open on that issue that will hopefully address this soon. >> >> >> On Sun, 24 Jul 2016 at 17:49 VG wrote: >> >>> ping. Anyone has some suggestions/advice for me . >>> It will be really helpful. >>> >>> VG >>> On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: >>> Sean, I did this just to test the model. When I do a split of my data as training to 80% and test to be 20% I get a Root-mean-square error = NaN So I am wondering where I might be going wrong Regards, VG On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: > No, that's certainly not to be expected. ALS works by computing a much > lower-rank representation of the input. It would not reproduce the > input exactly, and you don't want it to -- this would be seriously > overfit. This is why in general you don't evaluate a model on the > training set. > > On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: > > I am trying to run ml.ALS to compute some recommendations. > > > > Just to test I am using the same dataset for training using ALSModel > and for > > predicting the results based on the model . > > > > When I evaluate the result using RegressionEvaluator I get a > > Root-mean-square error = 1.5544064263236066 > > > > I thin this should be 0. Any suggestions what might be going wrong. > > > > Regards, > > Vipul > >
Is RowMatrix missing in org.apache.spark.ml package?
It is present in mlib but I don't seem to find it in ml package. Any suggestions please ? -Rohit
Spark 2.0 -- spark warehouse relative path in absolute URI error
I upgraded from 2.0.0-preview to 2.0.0 and I started getting the following error Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/ibm/spark-warehouse Any ideas how to fix this -Rohit
ClassTag variable in broadcast in spark 2.0 ? how to use
In spark 2.0 there is an addtional parameter of type ClassTag in the broadcast method of the sparkContext What is this variable and how to do broadcast now? here is my exisitng code with 2.0.0-preview Broadcast> b = jsc.broadcast(u.collectAsMap()); what changes needs to be done in 2.0 for this Broadcast> b = jsc.broadcast(u.collectAsMap(), *??* ); Please help Rohit
Re: ClassTag variable in broadcast in spark 2.0 ? how to use
My bad. Please ignore this question. I accidentally reverted to sparkContext causing the issue On Thu, Jul 28, 2016 at 11:36 PM, Rohit Chaddha wrote: > In spark 2.0 there is an addtional parameter of type ClassTag in the > broadcast method of the sparkContext > > What is this variable and how to do broadcast now? > > here is my exisitng code with 2.0.0-preview > Broadcast> b = jsc.broadcast(u.collectAsMap()); > > what changes needs to be done in 2.0 for this > Broadcast> b = jsc.broadcast(u.collectAsMap(), *??* ); > > Please help > > Rohit >
Re: Spark 2.0 -- spark warehouse relative path in absolute URI error
Hello Sean, I have tried both file:/ and file:/// Bit it does not work and give the same error -Rohit On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen wrote: > IIRC that was fixed, in that this is actually an invalid URI. Use > file:/C:/... I think. > > On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha > wrote: > > I upgraded from 2.0.0-preview to 2.0.0 > > and I started getting the following error > > > > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > > file:C:/ibm/spark-warehouse > > > > Any ideas how to fix this > > > > -Rohit >
Re: Spark 2.0 -- spark warehouse relative path in absolute URI error
I am simply trying to do session.read().json("file:///C:/data/a.json"); in 2.0.0-preview it was working fine with sqlContext.read().json("C:/data/a.json"); -Rohit On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen wrote: > Hm, file:///C:/... doesn't work? that should certainly be an absolute > URI with an absolute path. What exactly is your input value for this > property? > > On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha > wrote: > > Hello Sean, > > > > I have tried both file:/ and file:/// > > Bit it does not work and give the same error > > > > -Rohit > > > > > > > > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen wrote: > >> > >> IIRC that was fixed, in that this is actually an invalid URI. Use > >> file:/C:/... I think. > >> > >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha > >> wrote: > >> > I upgraded from 2.0.0-preview to 2.0.0 > >> > and I started getting the following error > >> > > >> > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > >> > file:C:/ibm/spark-warehouse > >> > > >> > Any ideas how to fix this > >> > > >> > -Rohit > > > > >
Re: Spark 2.0 -- spark warehouse relative path in absolute URI error
Sean, I saw some JIRA tickets and looks like this is still an open bug (rather than an improvement as marked in JIRA). https://issues.apache.org/jira/browse/SPARK-15893 https://issues.apache.org/jira/browse/SPARK-15899 I am experimenting, but do you know of any solution on top of your head On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha wrote: > I am simply trying to do > session.read().json("file:///C:/data/a.json"); > > in 2.0.0-preview it was working fine with > sqlContext.read().json("C:/data/a.json"); > > > -Rohit > > On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen wrote: > >> Hm, file:///C:/... doesn't work? that should certainly be an absolute >> URI with an absolute path. What exactly is your input value for this >> property? >> >> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha >> wrote: >> > Hello Sean, >> > >> > I have tried both file:/ and file:/// >> > Bit it does not work and give the same error >> > >> > -Rohit >> > >> > >> > >> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen wrote: >> >> >> >> IIRC that was fixed, in that this is actually an invalid URI. Use >> >> file:/C:/... I think. >> >> >> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha >> >> wrote: >> >> > I upgraded from 2.0.0-preview to 2.0.0 >> >> > and I started getting the following error >> >> > >> >> > Caused by: java.net.URISyntaxException: Relative path in absolute >> URI: >> >> > file:C:/ibm/spark-warehouse >> >> > >> >> > Any ideas how to fix this >> >> > >> >> > -Rohit >> > >> > >> > >
Re: Spark 2.0 -- spark warehouse relative path in absolute URI error
After looking at the comments - I am not sure what the proposed fix is ? On Fri, Jul 29, 2016 at 12:47 AM, Sean Owen wrote: > Ah, right. This wasn't actually resolved. Yeah your input on 15899 > would be welcome. See if the proposed fix helps. > > On Thu, Jul 28, 2016 at 11:52 AM, Rohit Chaddha > wrote: > > Sean, > > > > I saw some JIRA tickets and looks like this is still an open bug (rather > > than an improvement as marked in JIRA). > > > > https://issues.apache.org/jira/browse/SPARK-15893 > > https://issues.apache.org/jira/browse/SPARK-15899 > > > > I am experimenting, but do you know of any solution on top of your head > > > > > > > > On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha < > rohitchaddha1...@gmail.com> > > wrote: > >> > >> I am simply trying to do > >> session.read().json("file:///C:/data/a.json"); > >> > >> in 2.0.0-preview it was working fine with > >> sqlContext.read().json("C:/data/a.json"); > >> > >> > >> -Rohit > >> > >> On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen wrote: > >>> > >>> Hm, file:///C:/... doesn't work? that should certainly be an absolute > >>> URI with an absolute path. What exactly is your input value for this > >>> property? > >>> > >>> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha > >>> wrote: > >>> > Hello Sean, > >>> > > >>> > I have tried both file:/ and file:/// > >>> > Bit it does not work and give the same error > >>> > > >>> > -Rohit > >>> > > >>> > > >>> > > >>> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen > wrote: > >>> >> > >>> >> IIRC that was fixed, in that this is actually an invalid URI. Use > >>> >> file:/C:/... I think. > >>> >> > >>> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha > >>> >> wrote: > >>> >> > I upgraded from 2.0.0-preview to 2.0.0 > >>> >> > and I started getting the following error > >>> >> > > >>> >> > Caused by: java.net.URISyntaxException: Relative path in absolute > >>> >> > URI: > >>> >> > file:C:/ibm/spark-warehouse > >>> >> > > >>> >> > Any ideas how to fix this > >>> >> > > >>> >> > -Rohit > >>> > > >>> > > >> > >> > > >
calling dataset.show on a custom object - displays toString() value as first column and blank for rest
I have a custom object called A and corresponding Dataset when I call datasetA.show() method i get the following +++-+-+---+ |id|da|like|values|uid| +++-+-+---+ |A.toString()...| |A.toString()...| |A.toString()...| |A.toString()...| |A.toString()...| |A.toString()...| that is A.toString() is called and displayed as value of the first column and rest all columns are blank Any suggestions what should be done to fix this ? - Rohit
build error - failing test- Error while building spark 2.0 trunk from github
--- T E S T S --- Running org.apache.spark.api.java.OptionalSuite Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.052 sec - in org.apache.spark.api.java.OptionalSuite Running org.apache.spark.JavaAPISuite Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 23.537 sec <<< FAILURE! - in org.apache.spark.JavaAPISuite wholeTextFiles(org.apache.spark.JavaAPISuite) Time elapsed: 0.331 sec <<< FAILURE! java.lang.AssertionError: expected: but was: at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1087) Running org.apache.spark.JavaJdbcRDDSuite Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.799 sec - in org.apache.spark.JavaJdbcRDDSuite Running org.apache.spark.launcher.SparkLauncherSuite Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.04 sec <<< FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite) Time elapsed: 0.03 sec <<< FAILURE! java.lang.AssertionError: expected:<0> but was:<1> at org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:110) Running org.apache.spark.memory.TaskMemoryManagerSuite Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec - in org.apache.spark.memory.TaskMemoryManagerSuite Running org.apache.spark.shuffle.sort.PackedRecordPointerSuite Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec - in org.apache.spark.shuffle.sort.PackedRecordPointerSuite Running org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.103 sec - in org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite Running org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.199 sec - in org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite Running org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.67 sec - in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.97 sec - in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.583 sec - in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite Running org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.533 sec - in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite Running org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.606 sec - in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite Running org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec - in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite Running org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec - in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Results : Failed tests: JavaAPISuite.wholeTextFiles:1087 expected: but was: SparkLauncherSuite.testChildProcLauncher:110 expected:<0> but was:<1> Tests run: 189, Failures: 2, Errors: 0, Skipped: 0
Calling KmeansModel predict method
The predict method takes a Vector object I am unable to figure out how to make this spark vector object for getting predictions from my model. Does anyone has some code in java for this ? Thanks Rohit
Machine learning question (suing spark)- removing redundant factors while doing clustering
I have a data-set where each data-point has 112 factors. I want to remove the factors which are not relevant, and say reduce to 20 factors out of these 112 and then do clustering of data-points using these 20 factors. How do I do these and how do I figure out which of the 20 factors are useful for analysis. I see SVD and PCA implementations, but I am not sure if these give which elements are removed and which are remaining. Can someone please help me understand what to do here thanks, -Rohit
Re: Machine learning question (suing spark)- removing redundant factors while doing clustering
I would rather have less features to make better inferences on the data based on the smaller number of factors, Any suggestions Sean ? On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen wrote: > Yes, that's exactly what PCA is for as Sivakumaran noted. Do you > really want to select features or just obtain a lower-dimensional > representation of them, with less redundancy? > > On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane wrote: > > There must be an algorithmic way to figure out which of these factors > > contribute the least and remove them in the analysis. > > I am hoping same one can throw some insight on this. > > > > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S > wrote: > >> > >> Not an expert here, but the first step would be devote some time and > >> identify which of these 112 factors are actually causative. Some domain > >> knowledge of the data may be required. Then, you can start of with PCA. > >> > >> HTH, > >> > >> Regards, > >> > >> Sivakumaran S > >> > >> On 08-Aug-2016, at 3:01 PM, Tony Lane wrote: > >> > >> Great question Rohit. I am in my early days of ML as well and it would > be > >> great if we get some idea on this from other experts on this group. > >> > >> I know we can reduce dimensions by using PCA, but i think that does not > >> allow us to understand which factors from the original are we using in > the > >> end. > >> > >> - Tony L. > >> > >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha < > rohitchaddha1...@gmail.com> > >> wrote: > >>> > >>> > >>> I have a data-set where each data-point has 112 factors. > >>> > >>> I want to remove the factors which are not relevant, and say reduce to > 20 > >>> factors out of these 112 and then do clustering of data-points using > these > >>> 20 factors. > >>> > >>> How do I do these and how do I figure out which of the 20 factors are > >>> useful for analysis. > >>> > >>> I see SVD and PCA implementations, but I am not sure if these give > which > >>> elements are removed and which are remaining. > >>> > >>> Can someone please help me understand what to do here > >>> > >>> thanks, > >>> -Rohit > >>> > >> > >> > > >
Re: Machine learning question (suing spark)- removing redundant factors while doing clustering
@Peyman - does any of the clustering algorithms have "feature Importance" or "feature selection" ability ? I can't seem to pinpoint On Tue, Aug 9, 2016 at 8:49 AM, Peyman Mohajerian wrote: > You can try 'feature Importances' or 'feature selection' depending on what > else you want to do with the remaining features that's a possibility. Let's > say you are trying to do classification then some of the Spark Libraries > have a model parameter called 'featureImportances' that tell you which > feature(s) are more dominant in you classification, you can then run your > model again with the smaller set of features. > The two approaches are quite different, what I'm suggesting involves > training (supervised learning) in the context of a target function, with > SVD you are doing unsupervised learning. > > On Mon, Aug 8, 2016 at 7:23 PM, Rohit Chaddha > wrote: > >> I would rather have less features to make better inferences on the data >> based on the smaller number of factors, >> Any suggestions Sean ? >> >> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen wrote: >> >>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you >>> really want to select features or just obtain a lower-dimensional >>> representation of them, with less redundancy? >>> >>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane >>> wrote: >>> > There must be an algorithmic way to figure out which of these factors >>> > contribute the least and remove them in the analysis. >>> > I am hoping same one can throw some insight on this. >>> > >>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S >>> wrote: >>> >> >>> >> Not an expert here, but the first step would be devote some time and >>> >> identify which of these 112 factors are actually causative. Some >>> domain >>> >> knowledge of the data may be required. Then, you can start of with >>> PCA. >>> >> >>> >> HTH, >>> >> >>> >> Regards, >>> >> >>> >> Sivakumaran S >>> >> >>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane wrote: >>> >> >>> >> Great question Rohit. I am in my early days of ML as well and it >>> would be >>> >> great if we get some idea on this from other experts on this group. >>> >> >>> >> I know we can reduce dimensions by using PCA, but i think that does >>> not >>> >> allow us to understand which factors from the original are we using >>> in the >>> >> end. >>> >> >>> >> - Tony L. >>> >> >>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha < >>> rohitchaddha1...@gmail.com> >>> >> wrote: >>> >>> >>> >>> >>> >>> I have a data-set where each data-point has 112 factors. >>> >>> >>> >>> I want to remove the factors which are not relevant, and say reduce >>> to 20 >>> >>> factors out of these 112 and then do clustering of data-points using >>> these >>> >>> 20 factors. >>> >>> >>> >>> How do I do these and how do I figure out which of the 20 factors are >>> >>> useful for analysis. >>> >>> >>> >>> I see SVD and PCA implementations, but I am not sure if these give >>> which >>> >>> elements are removed and which are remaining. >>> >>> >>> >>> Can someone please help me understand what to do here >>> >>> >>> >>> thanks, >>> >>> -Rohit >>> >>> >>> >> >>> >> >>> > >>> >> >> >
Re: Machine learning question (suing spark)- removing redundant factors while doing clustering
Hi Sean, So basically I am trying to cluster a number of elements (its a domain object called PItem) based on a the quality factors of these items. These elements have 112 quality factors each. Now the issue is that when I am scaling the factors using StandardScaler I get a Sum of Squared Errors = 13300 When I don't use scaling the Sum of Squared Errors = 5 I was always of the opinion that different factors being on different scale should always be normalized, but I am confused based on the results above and I am wondering what factors should be removed to get a meaningful result (may be with 5% less accuracy) Will appreciate any help here. -Rohit On Tue, Aug 9, 2016 at 12:55 PM, Sean Owen wrote: > Fewer features doesn't necessarily mean better predictions, because indeed > you are subtracting data. It might, because when done well you subtract > more noise than signal. It is usually done to make data sets smaller or > more tractable or to improve explainability. > > But you have an unsupervised clustering problem where talking about > feature importance doesnt make as much sense. Important to what? There is > no target variable. > > PCA will not 'improve' clustering per se but can make it faster. > You may want to specify what you are actually trying to optimize. > > > On Tue, Aug 9, 2016, 03:23 Rohit Chaddha > wrote: > >> I would rather have less features to make better inferences on the data >> based on the smaller number of factors, >> Any suggestions Sean ? >> >> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen wrote: >> >>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you >>> really want to select features or just obtain a lower-dimensional >>> representation of them, with less redundancy? >>> >>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane >>> wrote: >>> > There must be an algorithmic way to figure out which of these factors >>> > contribute the least and remove them in the analysis. >>> > I am hoping same one can throw some insight on this. >>> > >>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S >>> wrote: >>> >> >>> >> Not an expert here, but the first step would be devote some time and >>> >> identify which of these 112 factors are actually causative. Some >>> domain >>> >> knowledge of the data may be required. Then, you can start of with >>> PCA. >>> >> >>> >> HTH, >>> >> >>> >> Regards, >>> >> >>> >> Sivakumaran S >>> >> >>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane wrote: >>> >> >>> >> Great question Rohit. I am in my early days of ML as well and it >>> would be >>> >> great if we get some idea on this from other experts on this group. >>> >> >>> >> I know we can reduce dimensions by using PCA, but i think that does >>> not >>> >> allow us to understand which factors from the original are we using >>> in the >>> >> end. >>> >> >>> >> - Tony L. >>> >> >>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha < >>> rohitchaddha1...@gmail.com> >>> >> wrote: >>> >>> >>> >>> >>> >>> I have a data-set where each data-point has 112 factors. >>> >>> >>> >>> I want to remove the factors which are not relevant, and say reduce >>> to 20 >>> >>> factors out of these 112 and then do clustering of data-points using >>> these >>> >>> 20 factors. >>> >>> >>> >>> How do I do these and how do I figure out which of the 20 factors are >>> >>> useful for analysis. >>> >>> >>> >>> I see SVD and PCA implementations, but I am not sure if these give >>> which >>> >>> elements are removed and which are remaining. >>> >>> >>> >>> Can someone please help me understand what to do here >>> >>> >>> >>> thanks, >>> >>> -Rohit >>> >>> >>> >> >>> >> >>> > >>> >> >>