Hi, When running two experiments with the same application, we see a 50% performance difference between using HDFS and files on disk, both using the textFile/saveAsTextFile call. Almost all performance loss is in Stage 1.
Input (in Stage 0): The file is read in using val input = sc.textFile(inputFile). The total input size is 500GB. The files on disk are partitioned into 128 MB files, HDFS is set to a block size of 128MB. When looking at the the number of task, we see 4x more task. We have seen this before, and it seems that this is because Spark breaks up the files in to 32MB files. This is not the case in HDFS. Output (in Stage 1): The file is written using saveAsTextFile(outputFile). The total output size is 500GB. Because we use a custom partittioner, we always have 9025 task in this stage. This is the stage where we see most performance loss. Questions: * What is the cause of the performance loss? -> Possible answers: Because of the block size (e.g. 128MB vs 33 MB) the write is less efficient (more/less data being transferred at once) or Because of the block size we need to open 4x as many files, leading to a performance loss * How can we solve this? (We would like to not use HDFS) * Bonus question: Should I use a different API to get a better performance? Thanks for any responses! Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org