Hi, I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0 and 1).
It runs extremely fast in computing similarities between Entities in most of the case, but if there is even a single attribute which is frequently occurring across the entities (say in 30% of entities), job falls apart. Whole job get stuck and worker nodes start running on 100% CPU without making any progress on the job stage. If the dataset is very small (in the range of 1000 Entities X 500 attributes (some frequently occurring)) the job finishes but takes too long (some time it gives GC errors too). If none of the attribute is frequently occurring (all < 2%), then job runs in a lightning fast manner (even for 1000000 Entities X 10000 attributes) and results are very accurate. I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 16GB of RAM. My question is - Is this behavior expected for datasets where some Attributes frequently occur? Thanks, Manish Gupta