Hi,
>> And that's what I'd
>> like to avoid. Current ("local") solutions are usually limited by the
>> network bandwidth, and hadoop offers some relief on that.
>>
>
> How does network bandwidth come into play in a "local" solution?
Data may not fit on one disk and must be streamed through the network
to the learning algorithm. If the data does indeed fit onto one disk,
the algorithm becomes disk bandwidth bound.
>> In a way, I want a sequential program scheduled through hadoop. I will
>> loose the parallelism, but I want to keep data locality, scheduling
>> and restart-on-failure.
>
>
> You're still doing things partially parallelized, right? Because your
> input
> data set is large enough to need to be split between machines, and your
> algorithm can work on each chunk independently? Or is this not
> the case?
There is no parallelism to be exploited: I'm doing SGD-style learning.
As the parallelization thereof is a largely unsolved problem, the
learning is strictly sequential. The desire to run it on a hadoop
cluster stems from the fact that data preprocessing and the
application of the learned model is a perfect fit for it. It would be
neat if the actual learning could be done on the cluster as well, if
only on a single, carefully chosen node close to the data.
Thanks,
Markus