If an algorithm has a stochastic/random element, no it won't necessarily
produce the same result, by design. If you can fix the seed of the random
number generator, you should get the same result. Except that if the
process is multi-threaded or distributed, even that doesn't guarantee it --
the RNG could be accessed in a different order. Even if you can control
your code it can be hard to control the RNGs in third-party libraries. Even
in a deterministic single-threaded program Java's floating point results
are not guaranteed to be the same across platforms (unless you use
strictfp).

ALS definitely has a random starting point, so reproducibility is not
guaranteed even from the top. If you fix the random seed in the context of
this project's unit tests, you *should* get the same result since I think
it manages to use no third-party RNGs and runs a test from a fixed starting
point in 1 thread.

KNN does not have a stochastic element. I think you would get the same
results on one platform, unless I'm missing something.

I don't think exact reproducibility is an issue. Certainly at scale where
the entire computation is distributed over such a complex cluster
environment. Most ML is about guessing at what's not known anyway. As long
as very small differences make only very small differences in the outcome,
differing FP behavior will make no or vanishingly small difference.

The only place where I think FP reproducibility matters -- of the sort that
numerical libraries care about -- is in under/overflow issues. But that is
solved by moving into a log space or something. You would never want to
depend on the nth significant digit of a float mattering.




On Sun, Mar 17, 2013 at 1:43 PM, Koobas <koo...@gmail.com> wrote:

> I am asking the basic reproducibility question.
> If I run twice on the same dataset, with the same hardware setup, will I
> always get the same resuts?
> Or is there any chance that on two different runs, the same user will get
> slightly different suggestions?
> I am mostly revolving in the space of numerical libraries, where
> reproducibility is, sort of, a big deal.
> Maybe it's not much of a concern in machine learning.
> I am just curious.
>
>
> On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen <sro...@gmail.com> wrote:
>
> > What's your question? ALS has a random starting point which changes the
> > results a bit. Not sure about KNN though.
> >
> >
>
> > On Sun, Mar 17, 2013 at 3:03 AM, Koobas <koo...@gmail.com> wrote:
> >
> > > Can anybody shed any light on the issue of reproducibility in Mahout,
> > > with and without Hadoop, specifically in the context of kNN and ALS
> > > recommenders?
> > >
> >
>

Reply via email to