This might be a question for Xiangrui. Recently I was using
BinaryClassificationMetrics to build an AUC curve for a classifier
over a reasonably large number of points (~12M). The scores were all
probabilities, so tended to be almost entirely unique.

The computation does some operations by key, and this ran out of
memory. It's something you can solve with more than the default amount
of memory, but in this case, it seemed unuseful to create an AUC curve
with such fine-grained resolution.

I ended up just binning the scores so there were ~1000 unique values
and then it was fine.

Does that sound generally useful as some kind of parameter? or am I
missing a trick here.

Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to