RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Hi Reza,

Behavior:

· I tried running the job with different thresholds - 0.1, 0.5, 5, 20  
100.  Every time, the job got stuck at mapPartitionsWithIndex at 
RowMatrix.scala:522http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118attempt=0
 with all workers running on 100% CPU. There is hardly any shuffle read/write 
happening. And after some time, “ERROR YarnClientClusterScheduler: Lost 
executor” start showing (maybe because of the nodes running on 100% CPU).

· For threshold 200+ (tried up to 1000) it gave an error (here 
 was different for different thresholds)
Exception in thread main java.lang.IllegalArgumentException: requirement 
failed: Oversampling should be greater than 1: 0.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
at EntitySimilarity$.main(EntitySimilarity.scala:80)
at EntitySimilarity.main(EntitySimilarity.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

· If I get rid of frequently occurring attributes and keep only those 
attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data  environment:

· RowMatrix of size 43345 X 56431

· In the matrix there are couple of rows, whose value is same in up to 
50% of the columns (frequently occurring attributes).

· I am running this, on one of our Dev cluster running on CDH 5.3.0 5 
data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it 
on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.org
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring (all  2%), then job runs in a 
lightning fast manner (even for 100 Entities X 1 attributes) and 
results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 
16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes 
frequently occur?

Thanks,
Manish Gupta





RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Thanks Reza. It makes perfect sense.

Regards,
Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Thursday, March 19, 2015 11:58 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a 
single row is dense, that can end up overwhelming a machine. You can push that 
up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: 
so it scales linearly and across cluster with rows, but still quadratically 
with the number of columns. I will be updating the documentation to make this 
clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com wrote:
Hi Reza,

Behavior:

• I tried running the job with different thresholds - 0.1, 0.5, 5, 20  
100.  Every time, the job got stuck at mapPartitionsWithIndex at 
RowMatrix.scala:522http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118attempt=0
 with all workers running on 100% CPU. There is hardly any shuffle read/write 
happening. And after some time, “ERROR YarnClientClusterScheduler: Lost 
executor” start showing (maybe because of the nodes running on 100% CPU).

• For threshold 200+ (tried up to 1000) it gave an error (here 
 was different for different thresholds)
Exception in thread main java.lang.IllegalArgumentException: requirement 
failed: Oversampling should be greater than 1: 0.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
at EntitySimilarity$.main(EntitySimilarity.scala:80)
at EntitySimilarity.main(EntitySimilarity.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

• If I get rid of frequently occurring attributes and keep only those 
attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data  environment:

• RowMatrix of size 43345 X 56431

• In the matrix there are couple of rows, whose value is same in up to 
50% of the columns (frequently occurring attributes).

• I am running this, on one of our Dev cluster running on CDH 5.3.0 5 
data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it 
on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupt...@sapient.commailto:mgupt...@sapient.com]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring

Re: Column Similarity using DIMSUM

2015-03-19 Thread Reza Zadeh
Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn.
When a single row is dense, that can end up overwhelming a machine. You can
push that up with more RAM, but note that DIMSUM is meant for tall and
skinny matrices: so it scales linearly and across cluster with rows, but
still quadratically with the number of columns. I will be updating the
documentation to make this clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 mgupt...@sapient.com
wrote:

  Hi Reza,



 *Behavior*:

 · I tried running the job with different thresholds - 0.1, 0.5,
 5, 20  100.  Every time, the job got stuck at mapPartitionsWithIndex at
 RowMatrix.scala:522
 http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118attempt=0
  with
 all workers running on 100% CPU. There is hardly any shuffle read/write
 happening. And after some time, “ERROR YarnClientClusterScheduler: Lost
 executor” start showing (maybe because of the nodes running on 100% CPU).

 · For threshold 200+ (tried up to 1000) it gave an error (here
  was different for different thresholds)

 Exception in thread main java.lang.IllegalArgumentException: requirement
 failed: Oversampling should be greater than 1: 0.

 at scala.Predef$.require(Predef.scala:233)

 at
 org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)

 at
 org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)

 at
 EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)

 at EntitySimilarity$.main(EntitySimilarity.scala:80)

 at EntitySimilarity.main(EntitySimilarity.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
 Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)

 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)

 at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)

 at
 org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 · If I get rid of frequently occurring attributes and keep only
 those attributes which are occurring in at 2% entities, then job doesn’t
 stuck / fail.



 *Data  environment*:

 · RowMatrix of size 43345 X 56431

 · In the matrix there are couple of rows, whose value is same in
 up to 50% of the columns (frequently occurring attributes).

 · I am running this, on one of our Dev cluster running on CDH
 5.3.0 5 data nodes (each 4-core and 16GB RAM).



 My question – Do you think this is a hardware size issue and we should
 test it on larger machines?



 Regards,

 Manish



 *From:* Manish Gupta 8 [mailto:mgupt...@sapient.com]
 *Sent:* Wednesday, March 18, 2015 11:20 PM
 *To:* Reza Zadeh
 *Cc:* user@spark.apache.org
 *Subject:* RE: Column Similarity using DIMSUM



 Hi Reza,



 I have tried threshold to be only in the range of 0 to 1. I was not aware
 that threshold can be set to above 1.

 Will try and update.



 Thank You



 - Manish



 *From:* Reza Zadeh [mailto:r...@databricks.com r...@databricks.com]
 *Sent:* Wednesday, March 18, 2015 10:55 PM
 *To:* Manish Gupta 8
 *Cc:* user@spark.apache.org
 *Subject:* Re: Column Similarity using DIMSUM



 Hi Manish,

 Did you try calling columnSimilarities(threshold) with different threshold
 values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.

 Best,

 Reza



 On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mgupt...@sapient.com
 wrote:

   Hi,



 I am running Column Similarity (All Pairs Similarity using DIMSUM) in
 Spark on a dataset that looks like (Entity, Attribute, Value) after
 transforming the same to a row-oriented dense matrix format (one line per
 Attribute, one column per Entity, each cell with normalized value – between
 0 and 1).



 It runs extremely fast in computing similarities between Entities in most
 of the case, but if there is even a single attribute which is frequently
 occurring across the entities (say in 30% of entities), job falls apart.
 Whole job get stuck and worker nodes start running on 100% CPU without
 making any progress on the job stage. If the dataset is very small (in the
 range of 1000 Entities X 500 attributes (some frequently occurring)) the
 job finishes but takes too long (some time it gives GC errors too).



 If none of the attribute is frequently occurring (all  2%), then job runs
 in a lightning fast manner (even for 100 Entities X 1 attributes)
 and results are very accurate.



 I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
 and 16GB

RE: Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring (all  2%), then job runs in a 
lightning fast manner (even for 100 Entities X 1 attributes) and 
results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 
16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes 
frequently occur?

Thanks,
Manish Gupta





Re: Column Similarity using DIMSUM

2015-03-18 Thread Reza Zadeh
Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mgupt...@sapient.com
wrote:

   Hi,



 I am running Column Similarity (All Pairs Similarity using DIMSUM) in
 Spark on a dataset that looks like (Entity, Attribute, Value) after
 transforming the same to a row-oriented dense matrix format (one line per
 Attribute, one column per Entity, each cell with normalized value – between
 0 and 1).



 It runs extremely fast in computing similarities between Entities in most
 of the case, but if there is even a single attribute which is frequently
 occurring across the entities (say in 30% of entities), job falls apart.
 Whole job get stuck and worker nodes start running on 100% CPU without
 making any progress on the job stage. If the dataset is very small (in the
 range of 1000 Entities X 500 attributes (some frequently occurring)) the
 job finishes but takes too long (some time it gives GC errors too).



 If none of the attribute is frequently occurring (all  2%), then job runs
 in a lightning fast manner (even for 100 Entities X 1 attributes)
 and results are very accurate.



 I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
 and 16GB of RAM.



 My question is - *Is this behavior expected for datasets where some
 Attributes frequently occur*?



 Thanks,

 Manish Gupta