We do disk-to-disk iterative algorithms in spark all the time, on datasets
that do not fit in memory, and it works well for us. I usually have to do
some tuning of number of partitions for a new dataset but that's about it
in terms of inconveniences.
On May 26, 2016 2:07 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:


Spark can handle this true, but it is optimized for the idea that it works
it works on the same full dataset in-memory due to the underlying nature of
machine learning algorithms (iterative). Of course, you can spill over, but
that you should avoid.

That being said you should have read my final sentence about this. Both
systems develop and change.


On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote:


On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Spark is more for machine learning working iteravely over the whole same
> dataset in memory. Additionally it has streaming and graph processing
> capabilities that can be used together.
>

Hi Jörn,

The first part is actually no true. Spark can handle data far greater than
the aggregate memory available on a cluster. The more recent versions
(1.3+) of Spark have external operations for almost all built-in operators,
and while things may not be perfect, those external operators are becoming
more and more robust with each version of Spark.

Reply via email to