from:"\"Rohit Chaddha\""

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Rohit Chaddha

Great thanks both of you.  I was struggling with this issue as well.

-Rohit


On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar  wrote:

> Thanks Nick. I also ran into this issue.
> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
> then use the dataset for the evaluator. In real life, probably detect the
> NaN and recommend most popular on some window.
> HTH.
> Cheers
> 
>
> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath  > wrote:
>
>> It seems likely that you're running into
>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
>> test dataset in the train/test split contains users or items that were not
>> in the training set. Hence the model doesn't have computed factors for
>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>> turn results in NaN for the evaluator result.
>>
>> I have a PR open on that issue that will hopefully address this soon.
>>
>>
>> On Sun, 24 Jul 2016 at 17:49 VG  wrote:
>>
>>> ping. Anyone has some suggestions/advice for me .
>>> It will be really helpful.
>>>
>>> VG
>>> On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:
>>>
 Sean,

 I did this just to test the model. When I do a split of my data as
 training to 80% and test to be 20%

 I get a Root-mean-square error = NaN

 So I am wondering where I might be going wrong

 Regards,
 VG

 On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:

> No, that's certainly not to be expected. ALS works by computing a much
> lower-rank representation of the input. It would not reproduce the
> input exactly, and you don't want it to -- this would be seriously
> overfit. This is why in general you don't evaluate a model on the
> training set.
>
> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
> > I am trying to run ml.ALS to compute some recommendations.
> >
> > Just to test I am using the same dataset for training using ALSModel
> and for
> > predicting the results based on the model .
> >
> > When I evaluate the result using RegressionEvaluator I get a
> > Root-mean-square error = 1.5544064263236066
> >
> > I thin this should be 0. Any suggestions what might be going wrong.
> >
> > Regards,
> > Vipul
>


>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-25 Thread Rohit Chaddha

Hi Krishna,

Great .. I had no idea about this.  I tried your suggestion by using
na.drop() and got a rmse = 1.5794048211812495
Any suggestions how this can be reduced and the model improved ?

Regards,
Rohit

On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar  wrote:

> Thanks Nick. I also ran into this issue.
> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
> then use the dataset for the evaluator. In real life, probably detect the
> NaN and recommend most popular on some window.
> HTH.
> Cheers
> 
>
> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath  > wrote:
>
>> It seems likely that you're running into
>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
>> test dataset in the train/test split contains users or items that were not
>> in the training set. Hence the model doesn't have computed factors for
>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>> turn results in NaN for the evaluator result.
>>
>> I have a PR open on that issue that will hopefully address this soon.
>>
>>
>> On Sun, 24 Jul 2016 at 17:49 VG  wrote:
>>
>>> ping. Anyone has some suggestions/advice for me .
>>> It will be really helpful.
>>>
>>> VG
>>> On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:
>>>
 Sean,

 I did this just to test the model. When I do a split of my data as
 training to 80% and test to be 20%

 I get a Root-mean-square error = NaN

 So I am wondering where I might be going wrong

 Regards,
 VG

 On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:

> No, that's certainly not to be expected. ALS works by computing a much
> lower-rank representation of the input. It would not reproduce the
> input exactly, and you don't want it to -- this would be seriously
> overfit. This is why in general you don't evaluate a model on the
> training set.
>
> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
> > I am trying to run ml.ALS to compute some recommendations.
> >
> > Just to test I am using the same dataset for training using ALSModel
> and for
> > predicting the results based on the model .
> >
> > When I evaluate the result using RegressionEvaluator I get a
> > Root-mean-square error = 1.5544064263236066
> >
> > I thin this should be 0. Any suggestions what might be going wrong.
> >
> > Regards,
> > Vipul
>


>

Is RowMatrix missing in org.apache.spark.ml package?

2016-07-26 Thread Rohit Chaddha

It is present in mlib but I don't seem to find it in ml package.
Any suggestions please ?

-Rohit

Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha

I upgraded from 2.0.0-preview to 2.0.0
and I started getting the following error

Caused by: java.net.URISyntaxException: Relative path in absolute URI:
file:C:/ibm/spark-warehouse

Any ideas how to fix this

-Rohit

ClassTag variable in broadcast in spark 2.0 ? how to use

2016-07-28 Thread Rohit Chaddha

In spark 2.0 there is an addtional parameter of type ClassTag  in the
broadcast method of the sparkContext

What is this variable and how to do broadcast now?

here is my exisitng code with 2.0.0-preview
Broadcast> b = jsc.broadcast(u.collectAsMap());

what changes needs to be done in 2.0 for this
Broadcast> b = jsc.broadcast(u.collectAsMap(), *??* );

Please help

Rohit

Re: ClassTag variable in broadcast in spark 2.0 ? how to use

2016-07-28 Thread Rohit Chaddha

My bad. Please ignore this question.
I accidentally reverted to sparkContext causing the issue

On Thu, Jul 28, 2016 at 11:36 PM, Rohit Chaddha 
wrote:

> In spark 2.0 there is an addtional parameter of type ClassTag  in the
> broadcast method of the sparkContext
>
> What is this variable and how to do broadcast now?
>
> here is my exisitng code with 2.0.0-preview
> Broadcast> b = jsc.broadcast(u.collectAsMap());
>
> what changes needs to be done in 2.0 for this
> Broadcast> b = jsc.broadcast(u.collectAsMap(), *??* );
>
> Please help
>
> Rohit
>

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha

Hello Sean,

I have tried both  file:/  and file:///
Bit it does not work and give the same error

-Rohit



On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:

> IIRC that was fixed, in that this is actually an invalid URI. Use
> file:/C:/... I think.
>
> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
>  wrote:
> > I upgraded from 2.0.0-preview to 2.0.0
> > and I started getting the following error
> >
> > Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> > file:C:/ibm/spark-warehouse
> >
> > Any ideas how to fix this
> >
> > -Rohit
>

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha

I am simply trying to do
session.read().json("file:///C:/data/a.json");

in 2.0.0-preview it was working fine with
sqlContext.read().json("C:/data/a.json");


-Rohit

On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen  wrote:

> Hm, file:///C:/... doesn't work? that should certainly be an absolute
> URI with an absolute path. What exactly is your input value for this
> property?
>
> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
>  wrote:
> > Hello Sean,
> >
> > I have tried both  file:/  and file:///
> > Bit it does not work and give the same error
> >
> > -Rohit
> >
> >
> >
> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:
> >>
> >> IIRC that was fixed, in that this is actually an invalid URI. Use
> >> file:/C:/... I think.
> >>
> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
> >>  wrote:
> >> > I upgraded from 2.0.0-preview to 2.0.0
> >> > and I started getting the following error
> >> >
> >> > Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> >> > file:C:/ibm/spark-warehouse
> >> >
> >> > Any ideas how to fix this
> >> >
> >> > -Rohit
> >
> >
>

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha

Sean,

I saw some JIRA tickets and looks like this is still an open bug (rather
than an improvement as marked in JIRA).

https://issues.apache.org/jira/browse/SPARK-15893
https://issues.apache.org/jira/browse/SPARK-15899

I am experimenting, but do you know of any solution on top of your head



On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha 
wrote:

> I am simply trying to do
> session.read().json("file:///C:/data/a.json");
>
> in 2.0.0-preview it was working fine with
> sqlContext.read().json("C:/data/a.json");
>
>
> -Rohit
>
> On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen  wrote:
>
>> Hm, file:///C:/... doesn't work? that should certainly be an absolute
>> URI with an absolute path. What exactly is your input value for this
>> property?
>>
>> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
>>  wrote:
>> > Hello Sean,
>> >
>> > I have tried both  file:/  and file:///
>> > Bit it does not work and give the same error
>> >
>> > -Rohit
>> >
>> >
>> >
>> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen  wrote:
>> >>
>> >> IIRC that was fixed, in that this is actually an invalid URI. Use
>> >> file:/C:/... I think.
>> >>
>> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
>> >>  wrote:
>> >> > I upgraded from 2.0.0-preview to 2.0.0
>> >> > and I started getting the following error
>> >> >
>> >> > Caused by: java.net.URISyntaxException: Relative path in absolute
>> URI:
>> >> > file:C:/ibm/spark-warehouse
>> >> >
>> >> > Any ideas how to fix this
>> >> >
>> >> > -Rohit
>> >
>> >
>>
>
>

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-28 Thread Rohit Chaddha

After looking at the comments - I am not sure what the proposed fix is ?

On Fri, Jul 29, 2016 at 12:47 AM, Sean Owen  wrote:

> Ah, right. This wasn't actually resolved. Yeah your input on 15899
> would be welcome. See if the proposed fix helps.
>
> On Thu, Jul 28, 2016 at 11:52 AM, Rohit Chaddha
>  wrote:
> > Sean,
> >
> > I saw some JIRA tickets and looks like this is still an open bug (rather
> > than an improvement as marked in JIRA).
> >
> > https://issues.apache.org/jira/browse/SPARK-15893
> > https://issues.apache.org/jira/browse/SPARK-15899
> >
> > I am experimenting, but do you know of any solution on top of your head
> >
> >
> >
> > On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha <
> rohitchaddha1...@gmail.com>
> > wrote:
> >>
> >> I am simply trying to do
> >> session.read().json("file:///C:/data/a.json");
> >>
> >> in 2.0.0-preview it was working fine with
> >> sqlContext.read().json("C:/data/a.json");
> >>
> >>
> >> -Rohit
> >>
> >> On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen  wrote:
> >>>
> >>> Hm, file:///C:/... doesn't work? that should certainly be an absolute
> >>> URI with an absolute path. What exactly is your input value for this
> >>> property?
> >>>
> >>> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
> >>>  wrote:
> >>> > Hello Sean,
> >>> >
> >>> > I have tried both  file:/  and file:///
> >>> > Bit it does not work and give the same error
> >>> >
> >>> > -Rohit
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen 
> wrote:
> >>> >>
> >>> >> IIRC that was fixed, in that this is actually an invalid URI. Use
> >>> >> file:/C:/... I think.
> >>> >>
> >>> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
> >>> >>  wrote:
> >>> >> > I upgraded from 2.0.0-preview to 2.0.0
> >>> >> > and I started getting the following error
> >>> >> >
> >>> >> > Caused by: java.net.URISyntaxException: Relative path in absolute
> >>> >> > URI:
> >>> >> > file:C:/ibm/spark-warehouse
> >>> >> >
> >>> >> > Any ideas how to fix this
> >>> >> >
> >>> >> > -Rohit
> >>> >
> >>> >
> >>
> >>
> >
>

calling dataset.show on a custom object - displays toString() value as first column and blank for rest

2016-07-31 Thread Rohit Chaddha

I have a custom object called A and corresponding Dataset

when I call datasetA.show() method i get the following

+++-+-+---+
|id|da|like|values|uid|
+++-+-+---+
|A.toString()...|
|A.toString()...|
|A.toString()...|
|A.toString()...|
|A.toString()...|
|A.toString()...|

that is A.toString() is called and displayed as value of the first column
and rest all columns are blank

Any suggestions what should be done to fix this ?

- Rohit

build error - failing test- Error while building spark 2.0 trunk from github

2016-07-31 Thread Rohit Chaddha

---
 T E S T S
---
Running org.apache.spark.api.java.OptionalSuite
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.052 sec -
in org.apache.spark.api.java.OptionalSuite
Running org.apache.spark.JavaAPISuite
Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 23.537 sec
<<< FAILURE! - in org.apache.spark.JavaAPISuite
wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.331 sec  <<<
FAILURE!
java.lang.AssertionError:
expected: but was:
at
org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1087)

Running org.apache.spark.JavaJdbcRDDSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.799 sec -
in org.apache.spark.JavaJdbcRDDSuite
Running org.apache.spark.launcher.SparkLauncherSuite
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.04 sec
<<< FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite
testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite)  Time
elapsed: 0.03 sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<1>
at
org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:110)

Running org.apache.spark.memory.TaskMemoryManagerSuite
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec -
in org.apache.spark.memory.TaskMemoryManagerSuite
Running org.apache.spark.shuffle.sort.PackedRecordPointerSuite
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec -
in org.apache.spark.shuffle.sort.PackedRecordPointerSuite
Running org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.103 sec -
in org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite
Running org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.199 sec -
in org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite
Running org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.67 sec -
in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.97 sec -
in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.583 sec
- in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.533 sec
- in
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.606 sec
- in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec -
in
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec -
in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
MaxPermSize=512m; support was removed in 8.0

Results :

Failed tests:
  JavaAPISuite.wholeTextFiles:1087 expected: but was:
  SparkLauncherSuite.testChildProcLauncher:110 expected:<0> but was:<1>

Tests run: 189, Failures: 2, Errors: 0, Skipped: 0

Calling KmeansModel predict method

2016-08-03 Thread Rohit Chaddha

The predict method takes a Vector object
I am unable to figure out how to make this spark vector object for getting
predictions from my model.

Does anyone has some code in java for this ?

Thanks
Rohit

Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha

I have a data-set where each data-point has 112 factors.

I want to remove the factors which are not relevant, and say reduce to 20
factors out of these 112 and then do clustering of data-points using these
20 factors.

How do I do these and how do I figure out which of the 20 factors are
useful for analysis.

I see SVD and PCA implementations, but I am not sure if these give which
elements are removed and which are remaining.

Can someone please help me understand what to do here

thanks,
-Rohit

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha

I would rather have less features to make better inferences on the data
based on the smaller number of factors,
Any suggestions Sean ?

On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen  wrote:

> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
> really want to select features or just obtain a lower-dimensional
> representation of them, with less redundancy?
>
> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane  wrote:
> > There must be an algorithmic way to figure out which of these factors
> > contribute the least and remove them in the analysis.
> > I am hoping same one can throw some insight on this.
> >
> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S 
> wrote:
> >>
> >> Not an expert here, but the first step would be devote some time and
> >> identify which of these 112 factors are actually causative. Some domain
> >> knowledge of the data may be required. Then, you can start of with PCA.
> >>
> >> HTH,
> >>
> >> Regards,
> >>
> >> Sivakumaran S
> >>
> >> On 08-Aug-2016, at 3:01 PM, Tony Lane  wrote:
> >>
> >> Great question Rohit.  I am in my early days of ML as well and it would
> be
> >> great if we get some idea on this from other experts on this group.
> >>
> >> I know we can reduce dimensions by using PCA, but i think that does not
> >> allow us to understand which factors from the original are we using in
> the
> >> end.
> >>
> >> - Tony L.
> >>
> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <
> rohitchaddha1...@gmail.com>
> >> wrote:
> >>>
> >>>
> >>> I have a data-set where each data-point has 112 factors.
> >>>
> >>> I want to remove the factors which are not relevant, and say reduce to
> 20
> >>> factors out of these 112 and then do clustering of data-points using
> these
> >>> 20 factors.
> >>>
> >>> How do I do these and how do I figure out which of the 20 factors are
> >>> useful for analysis.
> >>>
> >>> I see SVD and PCA implementations, but I am not sure if these give
> which
> >>> elements are removed and which are remaining.
> >>>
> >>> Can someone please help me understand what to do here
> >>>
> >>> thanks,
> >>> -Rohit
> >>>
> >>
> >>
> >
>

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha

@Peyman - does any of the clustering algorithms have "feature Importance"
or "feature selection" ability ?  I can't seem to pinpoint



On Tue, Aug 9, 2016 at 8:49 AM, Peyman Mohajerian 
wrote:

> You can try 'feature Importances' or 'feature selection' depending on what
> else you want to do with the remaining features that's a possibility. Let's
> say you are trying to do classification then some of the Spark Libraries
> have a model parameter called 'featureImportances' that tell you which
> feature(s) are more dominant in you classification, you can then run your
> model again with the smaller set of features.
> The two approaches are quite different, what I'm suggesting involves
> training (supervised learning) in the context of a target function, with
> SVD you are doing unsupervised learning.
>
> On Mon, Aug 8, 2016 at 7:23 PM, Rohit Chaddha 
> wrote:
>
>> I would rather have less features to make better inferences on the data
>> based on the smaller number of factors,
>> Any suggestions Sean ?
>>
>> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen  wrote:
>>
>>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>>> really want to select features or just obtain a lower-dimensional
>>> representation of them, with less redundancy?
>>>
>>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane 
>>> wrote:
>>> > There must be an algorithmic way to figure out which of these factors
>>> > contribute the least and remove them in the analysis.
>>> > I am hoping same one can throw some insight on this.
>>> >
>>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S 
>>> wrote:
>>> >>
>>> >> Not an expert here, but the first step would be devote some time and
>>> >> identify which of these 112 factors are actually causative. Some
>>> domain
>>> >> knowledge of the data may be required. Then, you can start of with
>>> PCA.
>>> >>
>>> >> HTH,
>>> >>
>>> >> Regards,
>>> >>
>>> >> Sivakumaran S
>>> >>
>>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane  wrote:
>>> >>
>>> >> Great question Rohit.  I am in my early days of ML as well and it
>>> would be
>>> >> great if we get some idea on this from other experts on this group.
>>> >>
>>> >> I know we can reduce dimensions by using PCA, but i think that does
>>> not
>>> >> allow us to understand which factors from the original are we using
>>> in the
>>> >> end.
>>> >>
>>> >> - Tony L.
>>> >>
>>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <
>>> rohitchaddha1...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>> I have a data-set where each data-point has 112 factors.
>>> >>>
>>> >>> I want to remove the factors which are not relevant, and say reduce
>>> to 20
>>> >>> factors out of these 112 and then do clustering of data-points using
>>> these
>>> >>> 20 factors.
>>> >>>
>>> >>> How do I do these and how do I figure out which of the 20 factors are
>>> >>> useful for analysis.
>>> >>>
>>> >>> I see SVD and PCA implementations, but I am not sure if these give
>>> which
>>> >>> elements are removed and which are remaining.
>>> >>>
>>> >>> Can someone please help me understand what to do here
>>> >>>
>>> >>> thanks,
>>> >>> -Rohit
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-10 Thread Rohit Chaddha

Hi Sean,

So basically I am trying to cluster a number of elements (its a domain
object called PItem) based on a the quality factors of these items.
These elements have 112 quality factors each.

Now the issue is that when I am scaling the factors using StandardScaler I
get a Sum of Squared Errors = 13300
When I don't use scaling the Sum of Squared Errors = 5

I was always of the opinion that different factors being on different scale
should always be normalized, but I am confused based on the results above
and I am wondering what factors should be removed to get a meaningful
result (may be with 5% less accuracy)

Will appreciate any help here.

-Rohit

On Tue, Aug 9, 2016 at 12:55 PM, Sean Owen  wrote:

> Fewer features doesn't necessarily mean better predictions, because indeed
> you are subtracting data. It might, because when done well you subtract
> more noise than signal. It is usually done to make data sets smaller or
> more tractable or to improve explainability.
>
> But you have an unsupervised clustering problem where talking about
> feature importance doesnt make as much sense. Important to what? There is
> no target variable.
>
> PCA will not 'improve' clustering per se but can make it faster.
> You may want to specify what you are actually trying to optimize.
>
>
> On Tue, Aug 9, 2016, 03:23 Rohit Chaddha 
> wrote:
>
>> I would rather have less features to make better inferences on the data
>> based on the smaller number of factors,
>> Any suggestions Sean ?
>>
>> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen  wrote:
>>
>>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>>> really want to select features or just obtain a lower-dimensional
>>> representation of them, with less redundancy?
>>>
>>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane 
>>> wrote:
>>> > There must be an algorithmic way to figure out which of these factors
>>> > contribute the least and remove them in the analysis.
>>> > I am hoping same one can throw some insight on this.
>>> >
>>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S 
>>> wrote:
>>> >>
>>> >> Not an expert here, but the first step would be devote some time and
>>> >> identify which of these 112 factors are actually causative. Some
>>> domain
>>> >> knowledge of the data may be required. Then, you can start of with
>>> PCA.
>>> >>
>>> >> HTH,
>>> >>
>>> >> Regards,
>>> >>
>>> >> Sivakumaran S
>>> >>
>>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane  wrote:
>>> >>
>>> >> Great question Rohit.  I am in my early days of ML as well and it
>>> would be
>>> >> great if we get some idea on this from other experts on this group.
>>> >>
>>> >> I know we can reduce dimensions by using PCA, but i think that does
>>> not
>>> >> allow us to understand which factors from the original are we using
>>> in the
>>> >> end.
>>> >>
>>> >> - Tony L.
>>> >>
>>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <
>>> rohitchaddha1...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>>
>>> >>> I have a data-set where each data-point has 112 factors.
>>> >>>
>>> >>> I want to remove the factors which are not relevant, and say reduce
>>> to 20
>>> >>> factors out of these 112 and then do clustering of data-points using
>>> these
>>> >>> 20 factors.
>>> >>>
>>> >>> How do I do these and how do I figure out which of the 20 factors are
>>> >>> useful for analysis.
>>> >>>
>>> >>> I see SVD and PCA implementations, but I am not sure if these give
>>> which
>>> >>> elements are removed and which are remaining.
>>> >>>
>>> >>> Can someone please help me understand what to do here
>>> >>>
>>> >>> thanks,
>>> >>> -Rohit
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Is RowMatrix missing in org.apache.spark.ml package?

Spark 2.0 -- spark warehouse relative path in absolute URI error

ClassTag variable in broadcast in spark 2.0 ? how to use

Re: ClassTag variable in broadcast in spark 2.0 ? how to use

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

calling dataset.show on a custom object - displays toString() value as first column and blank for rest

build error - failing test- Error while building spark 2.0 trunk from github

Calling KmeansModel predict method

Machine learning question (suing spark)- removing redundant factors while doing clustering

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

17 matches

Site Navigation

Mail list logo

Footer information