Re: Using Spark on Data size larger than Memory size

Roger Hoover Thu, 05 Jun 2014 16:44:20 -0700

Hi Aaron,

When you say that sorting is being worked on, can you elaborate a little
more please?


If particular, I want to sort the items within each partition (not
globally) without necessarily bringing them all into memory at once.

Thanks,

Roger


On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> There is no fundamental issue if you're running on data that is larger
> than cluster memory size. Many operations can stream data through, and thus
> memory usage is independent of input data size. Certain operations require
> an entire *partition* (not dataset) to fit in memory, but there are not
> many instances of this left (sorting comes to mind, and this is being
> worked on).
>
> In general, one problem with Spark today is that you *can* OOM under
> certain configurations, and it's possible you'll need to change from the
> default configuration if you're using doing very memory-intensive jobs.
> However, there are very few cases where Spark would simply fail as a matter
> of course *-- *for instance, you can always increase the number of
> partitions to decrease the size of any given one. or repartition data to
> eliminate skew.
>
> Regarding impact on performance, as Mayur said, there may absolutely be an
> impact depending on your jobs. If you're doing a join on a very large
> amount of data with few partitions, then we'll have to spill to disk. If
> you can't cache your working set of data in memory, you will also see a
> performance degradation. Spark enables the use of memory to make things
> fast, but if you just don't have enough memory, it won't be terribly fast.
>
>
> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <mayur.rust...@gmail.com>
> wrote:
>
>> Clearly thr will be impact on performance but frankly depends on what you
>> are trying to achieve with the dataset.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com>
>> wrote:
>>
>>> Some inputs will be really helpful.
>>>
>>> Thanks,
>>> -Vibhor
>>>
>>>
>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am planning to use spark with HBase, where I generate RDD by reading
>>>> data from HBase Table.
>>>>
>>>> I want to know that in the case when the size of HBase Table grows
>>>> larger than the size of RAM available in the cluster, will the application
>>>> fail, or will there be an impact in performance ?
>>>>
>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>
>>>> Thanks,
>>>> -Vibhor
>>>>
>>>
>>>
>>>
>>> --
>>> Vibhor Banga
>>> Software Development Engineer
>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Reply via email to