There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on).
In general, one problem with Spark today is that you *can* OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course *-- *for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew. Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast. On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <mayur.rust...@gmail.com> wrote: > Clearly thr will be impact on performance but frankly depends on what you > are trying to achieve with the dataset. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com> > wrote: > >> Some inputs will be really helpful. >> >> Thanks, >> -Vibhor >> >> >> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I am planning to use spark with HBase, where I generate RDD by reading >>> data from HBase Table. >>> >>> I want to know that in the case when the size of HBase Table grows >>> larger than the size of RAM available in the cluster, will the application >>> fail, or will there be an impact in performance ? >>> >>> Any thoughts in this direction will be helpful and are welcome. >>> >>> Thanks, >>> -Vibhor >>> >> >> >> >> -- >> Vibhor Banga >> Software Development Engineer >> Flipkart Internet Pvt. Ltd., Bangalore >> >> >