Preferred RDD Size

2014-05-15 Thread Sai Prasanna
Hi, Is there any lower-bound on the size of RDD to optimally utilize the in-memory framework Spark. Say creating RDD for very small data set of some 64 MB is not as efficient as that of some 256 MB, then accordingly the application can be tuned. So is there a soft-lowerbound related to hadoop-blo

Re: Preferred RDD Size

2014-05-12 Thread Andrew Ash
At the minimum to get decent parallelization you'd want to have some data on every machine. If you're reading from HDFS, then the smallest you'd want is one HDFS block per server in your cluster. Note that Spark will work at smaller sizes, but in order to make use of all your machines when your p