On Wed, May 7, 2014 at 4:00 AM, Han JU <ju.han.fe...@gmail.com> wrote:
But in my experience, when reading directly from s3n, spark create only 1 > input partition per file, regardless of the file size. This may lead to > some performance problem if you have big files. You can (and perhaps should) always repartition() the RDD explicitly to increase your level of parallelism to match the number of cores in your cluster. It’s pretty quick, and will speed up all subsequent operations.