awesome. thanks for the update!
On Tue, Jun 20, 2017 at 10:04 AM, Owen O'Malley
wrote:
> The natives are very restless. I'm actively working on getting Hive 2.2
> released. I'm running through qfile tests now and I hope to have it in the
> next couple weeks. It will be quickly followed up by Hi
"Hive 3.x branch has text vectorization and LLAP cache support for it, so
hopefully the only relevant concern about Text will be the storage costs
due to poor compression (& the lack of updates)."
I kept hearing about vectorization, but later found out it was going to
work if i used ORC. Litterall
> 1) both do the same thing.
The start of this thread is the exact opposite - trying to suggest ORC is
better for storage & wanting to use it.
> As it relates the columnar formats, it is silly arms race.
I'm not sure "silly" is the operative word - we've lost a lot of fragmentation
of the c
"Hive and LLAP do support Parquet precisely because the developers want to
be able to process everyone's data."
Yes. But there are a number of optimizations on the Hive ORC side that we
know are not implemented on the Parquet support. Which is why I made my
statement. Impala( Parq=yes, orc=no) Hiv
On Tue, Jun 20, 2017 at 10:12 AM, Edward Capriolo
wrote:
> It is whack that two optimized row columnar formats exists and each
> respective project (hive/impala) has good support for one and lame/no
> support for the other.
>
We have two similar formats because they were designed at roughly the
It is whack that two optimized row columnar formats exists and each
respective project (hive/impala) has good support for one and lame/no
support for the other.
Impala is now an Apache project. Also 'whack' and 'lame' are technical
terms often used by the people in the real world that have to use
The natives are very restless. I'm actively working on getting Hive 2.2
released. I'm running through qfile tests now and I hope to have it in the
next couple weeks. It will be quickly followed up by Hive 2.3, which will
be more aggressive with features, but less stable.
.. Owen
On Mon, Jun 19, 2
You should also try LLAP. With ORC or text, it will cache the hot columns
and partitions in memory. I can't seem to find the slides yet, but the
Comcast team had good results with LLAP:
https://dataworkssummit.com/san-jose-2017/sessions/hadoop-query-performance-smackdown/
https://twitter.com/thej
Another option would be to try Facebook's Presto https://prestodb.io/
Like Impala, Presto is designed for fast interactive querying over Hive
tables, but it is also capable of querying data from many other SQL sources
(mySQL, postgreSQL, Kafka, Cassandra, ...
https://prestodb.io/docs/current/conne