RE: reading ORC format on Spark-SQL

Mich Talebzadeh Wed, 10 Feb 2016 13:51:35 -0800

Hi,


Your point on

 

" ORC readers are more efficient than reading text, but ORC readers cannot

split beyond a 64Mb chunk, while text readers can split down to 1 line per

task."

 

I thought you could decide on the stripe sizes less than default 64MB. For
example 16MB with setting 'orc.stripe.size'='16777216'

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 

 

-----Original Message-----
From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of Gopal
Vijayaraghavan
Sent: 10 February 2016 21:43
To: user@hive.apache.org
Subject: Re: reading ORC format on Spark-SQL

 

 

> The reason why I am asking this kind of question is reading csv file on

>Spark is linearly increasing as the data size increase a bit, but reading

>ORC format on Spark-SQL is still same as the data size increses in

><figure 2>.

...

> This cause is from (just property of reading ORC format) or (creating

>the table for input and loading the input in the table) or both?

 

ORC readers are more efficient than reading text, but ORC readers cannot

split beyond a 64Mb chunk, while text readers can split down to 1 line per

task.

 

So, it's possible the CSV readers are producing many many more divisions

and running the query using the full cluster always - splitting

indiscriminately is not always faster as each task has some fixed overhead

unrelated to the data size (like plan deserialization in Kryo).

 

For ORC - 59 tasks can run in the same time as 193 tasks, as long as

there's capacity to run 193 in a single pass (like 200 executors).

 

Until you run out of capacity, a distributed system *has* to show

sub-linear scaling - and will show flat scaling upto a particular point

because of Amdahl's law.

 

Cheers,

Gopal

RE: reading ORC format on Spark-SQL

Reply via email to