> The reason why I am asking this kind of question is reading csv file on >Spark is linearly increasing as the data size increase a bit, but reading >ORC format on Spark-SQL is still same as the data size increses in ><figure 2>. ... > This cause is from (just property of reading ORC format) or (creating >the table for input and loading the input in the table) or both?
ORC readers are more efficient than reading text, but ORC readers cannot split beyond a 64Mb chunk, while text readers can split down to 1 line per task. So, it's possible the CSV readers are producing many many more divisions and running the query using the full cluster always - splitting indiscriminately is not always faster as each task has some fixed overhead unrelated to the data size (like plan deserialization in Kryo). For ORC - 59 tasks can run in the same time as 193 tasks, as long as there's capacity to run 193 in a single pass (like 200 executors). Until you run out of capacity, a distributed system *has* to show sub-linear scaling - and will show flat scaling upto a particular point because of Amdahl's law. Cheers, Gopal