RE: Column Similarity using DIMSUM

Manish Gupta 8 Thu, 19 Mar 2015 20:34:19 -0700

Thanks Reza. It makes perfect sense.

Regards,
Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Thursday, March 19, 2015 11:58 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a 
single row is dense, that can end up overwhelming a machine. You can push that 
up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: 
so it scales linearly and across cluster with rows, but still quadratically 
with the number of columns. I will be updating the documentation to make this 
clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 
<mgupt...@sapient.com<mailto:mgupt...@sapient.com>> wrote:
Hi Reza,

Behavior:

•         I tried running the job with different thresholds - 0.1, 0.5, 5, 20 & 
100.  Every time, the job got stuck at mapPartitionsWithIndex at 
RowMatrix.scala:522<http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0>
 with all workers running on 100% CPU. There is hardly any shuffle read/write 
happening. And after some time, “ERROR YarnClientClusterScheduler: Lost 
executor” start showing (maybe because of the nodes running on 100% CPU).

•         For threshold 200+ (tried up to 1000) it gave an error (here 
xxxxxxxxxxxxxxxx was different for different thresholds)
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: Oversampling should be greater than 1: 0.xxxxxxxxxxxxxxxxxxxx
                at scala.Predef$.require(Predef.scala:233)
                at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
                at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
                at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
                at EntitySimilarity$.main(EntitySimilarity.scala:80)
                at EntitySimilarity.main(EntitySimilarity.scala)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
                at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
                at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

•         If I get rid of frequently occurring attributes and keep only those 
attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data & environment:

•         RowMatrix of size 43345 X 56431

•         In the matrix there are couple of rows, whose value is same in up to 
50% of the columns (frequently occurring attributes).

•         I am running this, on one of our Dev cluster running on CDH 5.3.0 5 
data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it 
on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupt...@sapient.com<mailto:mgupt...@sapient.com>]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
<mgupt...@sapient.com<mailto:mgupt...@sapient.com>> wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring (all < 2%), then job runs in a 
lightning fast manner (even for 1000000 Entities X 10000 attributes) and 
results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 
16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes 
frequently occur?

Thanks,
Manish Gupta

RE: Column Similarity using DIMSUM

Reply via email to