Re: Column Similarity using DIMSUM

Reza Zadeh Wed, 18 Mar 2015 10:26:54 -0700

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza


On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 <mgupt...@sapient.com>
wrote:

>   Hi,
>
>
>
> I am running Column Similarity (All Pairs Similarity using DIMSUM) in
> Spark on a dataset that looks like (Entity, Attribute, Value) after
> transforming the same to a row-oriented dense matrix format (one line per
> Attribute, one column per Entity, each cell with normalized value – between
> 0 and 1).
>
>
>
> It runs extremely fast in computing similarities between Entities in most
> of the case, but if there is even a single attribute which is frequently
> occurring across the entities (say in 30% of entities), job falls apart.
> Whole job get stuck and worker nodes start running on 100% CPU without
> making any progress on the job stage. If the dataset is very small (in the
> range of 1000 Entities X 500 attributes (some frequently occurring)) the
> job finishes but takes too long (some time it gives GC errors too).
>
>
>
> If none of the attribute is frequently occurring (all < 2%), then job runs
> in a lightning fast manner (even for 1000000 Entities X 10000 attributes)
> and results are very accurate.
>
>
>
> I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
> and 16GB of RAM.
>
>
>
> My question is - *Is this behavior expected for datasets where some
> Attributes frequently occur*?
>
>
>
> Thanks,
>
> Manish Gupta
>
>
>
>
>

Re: Column Similarity using DIMSUM

Reply via email to