retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Thomas Kwan
Hi there, We are using mllib 1.1.1, and doing Logistics Regression with a dataset of about 150M rows. The training part usually goes pretty smoothly without any retries. But during the prediction stage and BinaryClassificationMetrics stage, I am seeing retries with error of fetch failure. The

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Xiangrui Meng
Sean's PR may be relevant to this issue (https://github.com/apache/spark/pull/3702). As a workaround, you can try to truncate the raw scores to 4 digits (e.g., 0.5643215 - 0.5643) before sending it to BinaryClassificationMetrics. This may not work well if he score distribution is very skewed. See

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Sean Owen
Yes, my change is slightly downstream of this point in the processing though. The code is still creating a counter for each distinct score value, and then binning. I don't think that would cause a failure - just might be slow. At the extremes, you might see 'fetch failure' as a symptom of things