Hi Manish, Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher. Best, Reza
On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 <mgupt...@sapient.com> wrote: > Hi, > > > > I am running Column Similarity (All Pairs Similarity using DIMSUM) in > Spark on a dataset that looks like (Entity, Attribute, Value) after > transforming the same to a row-oriented dense matrix format (one line per > Attribute, one column per Entity, each cell with normalized value – between > 0 and 1). > > > > It runs extremely fast in computing similarities between Entities in most > of the case, but if there is even a single attribute which is frequently > occurring across the entities (say in 30% of entities), job falls apart. > Whole job get stuck and worker nodes start running on 100% CPU without > making any progress on the job stage. If the dataset is very small (in the > range of 1000 Entities X 500 attributes (some frequently occurring)) the > job finishes but takes too long (some time it gives GC errors too). > > > > If none of the attribute is frequently occurring (all < 2%), then job runs > in a lightning fast manner (even for 1000000 Entities X 10000 attributes) > and results are very accurate. > > > > I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores > and 16GB of RAM. > > > > My question is - *Is this behavior expected for datasets where some > Attributes frequently occur*? > > > > Thanks, > > Manish Gupta > > > > >