Re: Format dillema

Gopal Vijayaraghavan Thu, 22 Jun 2017 16:30:47 -0700

> I kept hearing about vectorization, but later found out it was going to work 
> if i used ORC.


Yes, it's a tautology - if you cared about performance, you'd use ORC, because 
ORC is the fastest format.

And doing performance work to support folks who don't quite care about it, is 
not exactly "see a need, fill a need".

> Litterally years have come and gone and we are talking like 3.x is going to 
> vectorize text.

Literally years have gone by since the feature came into Hive. Though it might 
have crept up on you - if Vectorization had been enabled by default, it 
would've been immediately obvious.

HIVE-9937 is so old, that I'd say the first line towards Text vectorization 
came in in Q1 2015.

In the current master, you can get a huge boost out of it - if you want you can 
run BI over 100Tb of text.

https://www.slideshare.net/Hadoop_Summit/llap-building-cloudfirst-bi/27

> … where some not negligible part of the features ONLY work with ORC.

You've got it backwards - ORC was designed to support those features.

Parquet could be following ORC closely, but at least the Java implementation 
hasn't.

Cheers,
Gopal

Re: Format dillema

Reply via email to