Hi,
Is there any lower-bound on the size of RDD to optimally utilize the
in-memory framework Spark.
Say creating RDD for very small data set of some 64 MB is not as efficient
as that of some 256 MB, then accordingly the application can be tuned.
So is there a soft-lowerbound related to hadoop-blo
At the minimum to get decent parallelization you'd want to have some data
on every machine. If you're reading from HDFS, then the smallest you'd
want is one HDFS block per server in your cluster.
Note that Spark will work at smaller sizes, but in order to make use of all
your machines when your p