Re: reading ORC format on Spark-SQL

Philip Lee Thu, 11 Feb 2016 07:59:24 -0800

Thansk for your reply!

according to you because of its natural property of ORC, it cannot be
splited by the default chunk.
Because it is not composed of lines like csv.


Until you run out of capacity, a distributed system *has* to show sub-linear
scaling -
and will show flat scaling upto a particular point because of Amdahl's law.

This sentence is a bit confusing. so time of reading CSV file on Spark is
linearnly increasing as the data increase.
because it employes the full cluster, which means it runs out of capacity?

On the other hand, the reason why time of reading ORC format shows flat
scaling.
because it is not over the capacity yet?

but you know loading csv file is not much big as I guess.

Could you correct me?
Thanks in advance.

Best,
Phil

On Wed, Feb 10, 2016 at 11:17 PM, Philip Lee <philjj...@gmail.com> wrote:

> Thansk for your reply!
>
> according to you because of its natural property of ORC, it cannot be
> splited by the default chunk.
> Because it is not composed of lines like csv.
>
> Until you run out of capacity, a distributed system *has* to show sub-linear
> scaling -
> and will show flat scaling upto a particular point because of Amdahl's
> law.
>
> This sentence is a bit confusing. so time of reading CSV file on Spark is
> linearnly increasing as the data increase.
> because it employes the full cluster, which means it runs out of capacity?
>
> On the other hand, the reason why time of reading ORC format shows flat
> scaling.
> because it is not over the capacity yet?
>
> but you know loading csv file is not much big as I guess.
>
> Could you correct me?
> Thanks in advance.
>
> Best,
> Phil
>
> On Wed, Feb 10, 2016 at 10:51 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
>> Hi,
>>
>>
>>
>> Your point on
>>
>>
>>
>> *" ORC readers are more efficient than reading text, but ORC readers
>> cannot*
>>
>> *split beyond a 64Mb chunk, while text readers can split down to 1 line
>> per*
>>
>> *task."*
>>
>>
>>
>> I thought you could decide on the stripe sizes less than default 64MB.
>> For example 16MB with setting 'orc.stripe.size'='16777216'
>>
>>
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of
>> Gopal Vijayaraghavan
>> Sent: 10 February 2016 21:43
>> To: user@hive.apache.org
>> Subject: Re: reading ORC format on Spark-SQL
>>
>>
>>
>>
>>
>> > The reason why I am asking this kind of question is reading csv file on
>>
>> >Spark is linearly increasing as the data size increase a bit, but reading
>>
>> >ORC format on Spark-SQL is still same as the data size increses in
>>
>> ><figure 2>.
>>
>> ...
>>
>> > This cause is from (just property of reading ORC format) or (creating
>>
>> >the table for input and loading the input in the table) or both?
>>
>>
>>
>> ORC readers are more efficient than reading text, but ORC readers cannot
>>
>> split beyond a 64Mb chunk, while text readers can split down to 1 line per
>>
>> task.
>>
>>
>>
>> So, it's possible the CSV readers are producing many many more divisions
>>
>> and running the query using the full cluster always - splitting
>>
>> indiscriminately is not always faster as each task has some fixed overhead
>>
>> unrelated to the data size (like plan deserialization in Kryo).
>>
>>
>>
>> For ORC - 59 tasks can run in the same time as 193 tasks, as long as
>>
>> there's capacity to run 193 in a single pass (like 200 executors).
>>
>>
>>
>> Until you run out of capacity, a distributed system *has* to show
>>
>> sub-linear scaling - and will show flat scaling upto a particular point
>>
>> because of Amdahl's law.
>>
>>
>>
>> Cheers,
>>
>> Gopal
>>
>
>
>
> --
>
> ==========================================================
>
> *Hae Joon Lee*
>
>
> Now, in Germany,
>
> M.S. Candidate, Interested in Distributed System, Iterative Processing
>
> Dept. of Computer Science, Informatik in German, TUB
>
> Technical University of Berlin
>
>
> In Korea,
>
> M.S. Candidate, Computer Architecture Laboratory
>
> Dept. of Computer Science, KAIST
>
>
> Rm# 4414 CS Dept. KAIST
>
> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>
>
> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>
> ==========================================================
>



-- 

==========================================================

*Hae Joon Lee*


Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin


In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST


Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)


Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==========================================================

Re: reading ORC format on Spark-SQL

Reply via email to