Hi Aaron, When you say that sorting is being worked on, can you elaborate a little more please?
If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once. Thanks, Roger On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <ilike...@gmail.com> wrote: > There is no fundamental issue if you're running on data that is larger > than cluster memory size. Many operations can stream data through, and thus > memory usage is independent of input data size. Certain operations require > an entire *partition* (not dataset) to fit in memory, but there are not > many instances of this left (sorting comes to mind, and this is being > worked on). > > In general, one problem with Spark today is that you *can* OOM under > certain configurations, and it's possible you'll need to change from the > default configuration if you're using doing very memory-intensive jobs. > However, there are very few cases where Spark would simply fail as a matter > of course *-- *for instance, you can always increase the number of > partitions to decrease the size of any given one. or repartition data to > eliminate skew. > > Regarding impact on performance, as Mayur said, there may absolutely be an > impact depending on your jobs. If you're doing a join on a very large > amount of data with few partitions, then we'll have to spill to disk. If > you can't cache your working set of data in memory, you will also see a > performance degradation. Spark enables the use of memory to make things > fast, but if you just don't have enough memory, it won't be terribly fast. > > > On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <mayur.rust...@gmail.com> > wrote: > >> Clearly thr will be impact on performance but frankly depends on what you >> are trying to achieve with the dataset. >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> >> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com> >> wrote: >> >>> Some inputs will be really helpful. >>> >>> Thanks, >>> -Vibhor >>> >>> >>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I am planning to use spark with HBase, where I generate RDD by reading >>>> data from HBase Table. >>>> >>>> I want to know that in the case when the size of HBase Table grows >>>> larger than the size of RAM available in the cluster, will the application >>>> fail, or will there be an impact in performance ? >>>> >>>> Any thoughts in this direction will be helpful and are welcome. >>>> >>>> Thanks, >>>> -Vibhor >>>> >>> >>> >>> >>> -- >>> Vibhor Banga >>> Software Development Engineer >>> Flipkart Internet Pvt. Ltd., Bangalore >>> >>> >> >