Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 <mgupt...@sapient.com>
wrote:

>   Hi,
>
>
>
> I am running Column Similarity (All Pairs Similarity using DIMSUM) in
> Spark on a dataset that looks like (Entity, Attribute, Value) after
> transforming the same to a row-oriented dense matrix format (one line per
> Attribute, one column per Entity, each cell with normalized value – between
> 0 and 1).
>
>
>
> It runs extremely fast in computing similarities between Entities in most
> of the case, but if there is even a single attribute which is frequently
> occurring across the entities (say in 30% of entities), job falls apart.
> Whole job get stuck and worker nodes start running on 100% CPU without
> making any progress on the job stage. If the dataset is very small (in the
> range of 1000 Entities X 500 attributes (some frequently occurring)) the
> job finishes but takes too long (some time it gives GC errors too).
>
>
>
> If none of the attribute is frequently occurring (all < 2%), then job runs
> in a lightning fast manner (even for 1000000 Entities X 10000 attributes)
> and results are very accurate.
>
>
>
> I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
> and 16GB of RAM.
>
>
>
> My question is - *Is this behavior expected for datasets where some
> Attributes frequently occur*?
>
>
>
> Thanks,
>
> Manish Gupta
>
>
>
>
>

Reply via email to