Re: any hive release imminent?

2017-06-20 Thread Stephen Sprague
awesome. thanks for the update! On Tue, Jun 20, 2017 at 10:04 AM, Owen O'Malley wrote: > The natives are very restless. I'm actively working on getting Hive 2.2 > released. I'm running through qfile tests now and I hope to have it in the > next couple weeks. It will be quickly followed up by Hi

Re: Format dillema

2017-06-20 Thread Edward Capriolo
"Hive 3.x branch has text vectorization and LLAP cache support for it, so hopefully the only relevant concern about Text will be the storage costs due to poor compression (& the lack of updates)." I kept hearing about vectorization, but later found out it was going to work if i used ORC. Litterall

Re: Format dillema

2017-06-20 Thread Gopal Vijayaraghavan
> 1) both do the same thing.  The start of this thread is the exact opposite - trying to suggest ORC is better for storage & wanting to use it. > As it relates the columnar formats, it is silly arms race. I'm not sure "silly" is the operative word - we've lost a lot of fragmentation of the c

Re: Format dillema

2017-06-20 Thread Edward Capriolo
"Hive and LLAP do support Parquet precisely because the developers want to be able to process everyone's data." Yes. But there are a number of optimizations on the Hive ORC side that we know are not implemented on the Parquet support. Which is why I made my statement. Impala( Parq=yes, orc=no) Hiv

Re: Format dillema

2017-06-20 Thread Owen O'Malley
On Tue, Jun 20, 2017 at 10:12 AM, Edward Capriolo wrote: > It is whack that two optimized row columnar formats exists and each > respective project (hive/impala) has good support for one and lame/no > support for the other. > We have two similar formats because they were designed at roughly the

Re: Format dillema

2017-06-20 Thread Edward Capriolo
It is whack that two optimized row columnar formats exists and each respective project (hive/impala) has good support for one and lame/no support for the other. Impala is now an Apache project. Also 'whack' and 'lame' are technical terms often used by the people in the real world that have to use

Re: any hive release imminent?

2017-06-20 Thread Owen O'Malley
The natives are very restless. I'm actively working on getting Hive 2.2 released. I'm running through qfile tests now and I hope to have it in the next couple weeks. It will be quickly followed up by Hive 2.3, which will be more aggressive with features, but less stable. .. Owen On Mon, Jun 19, 2

Re: Format dillema

2017-06-20 Thread Owen O'Malley
You should also try LLAP. With ORC or text, it will cache the hot columns and partitions in memory. I can't seem to find the slides yet, but the Comcast team had good results with LLAP: https://dataworkssummit.com/san-jose-2017/sessions/hadoop-query-performance-smackdown/ https://twitter.com/thej

Re: Format dillema

2017-06-20 Thread Furcy Pin
Another option would be to try Facebook's Presto https://prestodb.io/ Like Impala, Presto is designed for fast interactive querying over Hive tables, but it is also capable of querying data from many other SQL sources (mySQL, postgreSQL, Kafka, Cassandra, ... https://prestodb.io/docs/current/conne