Hi there,
We are using mllib 1.1.1, and doing Logistics Regression with a dataset of
about 150M rows.
The training part usually goes pretty smoothly without any retries. But
during the prediction stage and BinaryClassificationMetrics stage, I am
seeing retries with error of fetch failure.
The
Sean's PR may be relevant to this issue
(https://github.com/apache/spark/pull/3702). As a workaround, you can
try to truncate the raw scores to 4 digits (e.g., 0.5643215 - 0.5643)
before sending it to BinaryClassificationMetrics. This may not work
well if he score distribution is very skewed. See
Yes, my change is slightly downstream of this point in the processing
though. The code is still creating a counter for each distinct score
value, and then binning. I don't think that would cause a failure -
just might be slow. At the extremes, you might see 'fetch failure' as
a symptom of things