> The reason why I am asking this kind of question is reading csv file on
>Spark is linearly increasing as the data size increase a bit, but reading
>ORC format on Spark-SQL is still same as the data size increses in
><figure 2>.
...
> This cause is from (just property of reading ORC format) or (creating
>the table for input and loading the input in the table) or both?

ORC readers are more efficient than reading text, but ORC readers cannot
split beyond a 64Mb chunk, while text readers can split down to 1 line per
task.

So, it's possible the CSV readers are producing many many more divisions
and running the query using the full cluster always - splitting
indiscriminately is not always faster as each task has some fixed overhead
unrelated to the data size (like plan deserialization in Kryo).

For ORC - 59 tasks can run in the same time as 193 tasks, as long as
there's capacity to run 193 in a single pass (like 200 executors).

Until you run out of capacity, a distributed system *has* to show
sub-linear scaling - and will show flat scaling upto a particular point
because of Amdahl's law.

Cheers,
Gopal


Reply via email to