Re: Using Spark on Data size larger than Memory size

Aaron Davidson Sat, 31 May 2014 23:11:24 -0700

There is no fundamental issue if you're running on data that is larger than
cluster memory size. Many operations can stream data through, and thus
memory usage is independent of input data size. Certain operations require
an entire *partition* (not dataset) to fit in memory, but there are not
many instances of this left (sorting comes to mind, and this is being
worked on).

In general, one problem with Spark today is that you *can* OOM under
certain configurations, and it's possible you'll need to change from the
default configuration if you're using doing very memory-intensive jobs.
However, there are very few cases where Spark would simply fail as a matter
of course *-- *for instance, you can always increase the number of
partitions to decrease the size of any given one. or repartition data to
eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an
impact depending on your jobs. If you're doing a join on a very large
amount of data with few partitions, then we'll have to spill to disk. If
you can't cache your working set of data in memory, you will also see a
performance degradation. Spark enables the use of memory to make things
fast, but if you just don't have enough memory, it won't be terribly fast.

On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <mayur.rust...@gmail.com>
wrote:

> Clearly thr will be impact on performance but frankly depends on what you
> are trying to achieve with the dataset.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com>
> wrote:
>
>> Some inputs will be really helpful.
>>
>> Thanks,
>> -Vibhor
>>
>>
>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am planning to use spark with HBase, where I generate RDD by reading
>>> data from HBase Table.
>>>
>>> I want to know that in the case when the size of HBase Table grows
>>> larger than the size of RAM available in the cluster, will the application
>>> fail, or will there be an impact in performance ?
>>>
>>> Any thoughts in this direction will be helpful and are welcome.
>>>
>>> Thanks,
>>> -Vibhor
>>>
>>
>>
>>
>> --
>> Vibhor Banga
>> Software Development Engineer
>> Flipkart Internet Pvt. Ltd., Bangalore
>>
>>
>

Re: Using Spark on Data size larger than Memory size

Reply via email to