Re: Hive/TEZ/Parquet

2016-12-15 Thread Gopal Vijayaraghavan
 
> Actually, we don't have that many partitions - there are lot of gaps both in 
> days and time events as well.

Your partition description sounded a lot like one of the FAQs from Mithun's 
talks, which is why I asked

http://www.slideshare.net/Hadoop_Summit/hive-at-yahoo-letters-from-the-trenches/24

> But, I would like to understand when you say " time spent might partly be 
> query planning with million partitions"? I presume, this is in producing the 
> physical plan? -- does it spend time in allocating group of partition 
> directories to each map task

Yes, the physical planner is significant overhead, since all map-tasks get a 
list of all partitions and match each read against that list (partition schema 
can evolve, this is closer to O(n^2)) & the split-generation is bottlenecked by 
the total # of files involved in the operation (a slow O(n) operation is still 
slow).

The overhead in maintaining partitions is fairly high & the entire query 
planning will try to do something like a du -sh on each partition, if all the 
basic file statistics are missing etc.

Also, if you have >100k partitions, disabling the metadata optimizer 
(hive.optimize.metadataonly=false) would be a good thing, since that codepath 
is single threaded, while a compute-heavy full-table scan is much faster due to 
parallelism - it might burn more CPU, but it would come back in less than an 
hour.

You might want to opt for daily partitions & also run the stats gathering ops 
with "analyze table  compute statistics partialscan;" & "analyze table 
 compute statistics for columns" to speed up further queries.

At least in my experience with ORC + valid stats, a query like "select count(*) 
from table" should takes <500 milliseconds.

Cheers,
Gopal





Re: [ANNOUNCE] New Hive Committer - Rajesh Balamohan

2016-12-15 Thread Vaibhav Gumashta
Congrats Rajesh!

‹Vaibhav

On 12/15/16, 3:55 AM, "Peter Vary"  wrote:

>Congratulations Rajesh!
>
>> On Dec 15, 2016, at 6:40 AM, Rui Li  wrote:
>> 
>> Congratulations :)
>> 
>> On Thu, Dec 15, 2016 at 6:50 AM, Gunther Hagleitner <
>> ghagleit...@hortonworks.com> wrote:
>> 
>>> Congrats Rajesh!
>>> 
>>> From: Jimmy Xiang 
>>> Sent: Wednesday, December 14, 2016 11:38 AM
>>> To: user@hive.apache.org
>>> Cc: d...@hive.apache.org; rbalamo...@apache.org
>>> Subject: Re: [ANNOUNCE] New Hive Committer - Rajesh Balamohan
>>> 
>>> Congrats, Rajesh!!
>>> 
>>> On Wed, Dec 14, 2016 at 11:32 AM, Sergey Shelukhin
>>>  wrote:
 Congratulations!
 
 From: Chao Sun 
 Reply-To: "user@hive.apache.org" 
 Date: Wednesday, December 14, 2016 at 10:52
 To: "d...@hive.apache.org" 
 Cc: "user@hive.apache.org" , "
>>> rbalamo...@apache.org"
 
 Subject: Re: [ANNOUNCE] New Hive Committer - Rajesh Balamohan
 
 Congrats Rajesh!
 
 On Wed, Dec 14, 2016 at 9:26 AM, Vihang Karajgaonkar <
>>> vih...@cloudera.com>
 wrote:
> 
> Congrats Rajesh!
> 
> On Wed, Dec 14, 2016 at 1:54 AM, Jesus Camacho Rodriguez <
> jcamachorodrig...@hortonworks.com> wrote:
> 
>> Congrats Rajesh, well deserved! :)
>> 
>> --
>> Jesús
>> 
>> 
>> 
>> 
>> On 12/14/16, 8:41 AM, "Lefty Leverenz" 
>>> wrote:
>> 
>>> Congratulations Rajesh!
>>> 
>>> -- Lefty
>>> 
>>> 
>>> On Tue, Dec 13, 2016 at 11:58 PM, Rajesh Balamohan
>>> >> 
>>> wrote:
>>> 
 Thanks a lot for providing this opportunity and to all for their
>> messages.
 :)
 
 ~Rajesh.B
 
 On Wed, Dec 14, 2016 at 11:33 AM, Dharmesh Kakadia
 >> 
 wrote:
 
> Congrats Rajesh !
> 
> Thanks,
> Dharmesh
> 
> On Tue, Dec 13, 2016 at 7:37 PM, Vikram Dixit K <
>> vikram.di...@gmail.com>
> wrote:
> 
>> Congrats Rajesh! :)
>> 
>> On Tue, Dec 13, 2016 at 9:36 PM, Pengcheng Xiong
>> 
>> wrote:
>> 
>>> Congrats Rajesh! :)
>>> 
>>> On Tue, Dec 13, 2016 at 6:51 PM, Prasanth Jayachandran <
>>> prasan...@apache.org
 wrote:
>>> 
 The Apache Hive PMC has voted to make Rajesh Balamohan a
>> committer on
>>> the
 Apache Hive Project. Please join me in congratulating Rajesh.
 
 Congratulations Rajesh!
 
 Thanks
 Prasanth
>>> 
>> 
>> 
>> 
>> --
>> Nothing better than when appreciated for hard work.
>> -Mark
>> 
> 
> 
 
>> 
 
 
>>> 
>>> 
>> 
>> 
>> -- 
>> Best regards!
>> Rui Li
>> Cell: (+86) 13564950210
>
>



Re: Hive/TEZ/Parquet

2016-12-15 Thread VJ Anand
Thanks Gopal for the feedback.

Actually, we don't have that many partitions - there are lot of gaps both
in days and time events as well. But, I would like to understand when you
say " time spent might partly be query planning with million partitions"? I
presume, this is in producing the physical plan? -- does it spend time in
allocating group of partition directories to each map task? Can you
elaborate? or point me to any material to better understand this..

Thanks
VJ

On Thu, Dec 15, 2016 at 1:03 PM, Gopal Vijayaraghavan 
wrote:

> > The partition is by year/month/day/hour/minute. I have two directories -
> over two years, and the total number of records is 50Million.
>
> That's a million partitions with 50 rows in each of them?
>
> > I am seeing it takes more than 1hr to complete. Any thoughts, on what
> could be the issue or approach that can be taken to improve the performance?
>
> Looks like you have over-partitioned your data massively - the 1 hour
> might be partly query planning with million partitions and the rest might
> be file-count related overheads.
>
> At least in case of ORC, I recommend that the partitions contain at least
> 1 Gb of data & that if you really need to query down to finer levels, to
> use bloom filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering.
>
> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4
>
> Cheers,
> Gopal
>
>
>


-- 
*VJ Anand*
*Founder *
*Sankia*
vjan...@sankia.com
925-640-1340
www.sankia.com

*Confidentiality Notice*: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and privileged information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply e-mail and destroy all copies
of the original message


Re: Hive/TEZ/Parquet

2016-12-15 Thread Gopal Vijayaraghavan
> The partition is by year/month/day/hour/minute. I have two directories - over 
> two years, and the total number of records is 50Million.  

That's a million partitions with 50 rows in each of them?

> I am seeing it takes more than 1hr to complete. Any thoughts, on what could 
> be the issue or approach that can be taken to improve the performance?

Looks like you have over-partitioned your data massively - the 1 hour might be 
partly query planning with million partitions and the rest might be file-count 
related overheads.

At least in case of ORC, I recommend that the partitions contain at least 1 Gb 
of data & that if you really need to query down to finer levels, to use bloom 
filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering.

http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4

Cheers,
Gopal




Re: Hive/TEZ/Parquet

2016-12-15 Thread VJ Anand
I have a partitioned external Hive table, stored in parquet files. The
partition is by year/month/day/hour/minute. I have two directories - over
two years, and the total number of records is 50Million.  My cluster
configuration is 5 Nodes, with 8 cores and 64GB of RAM - total of 40 cores
and 300GB. I am running Hive using Tez as the engine. I have per container
setting as 4GB and VCore to 1. Additionally, I set the TEZ min input split
to 36MB and also max input Split to the same value 36GB.

When I submit a query Select count(*) from table. I see it allocates 43 Map
Tasks and 1 Reducer task.

I am seeing it takes more than 1hr to complete. Any thoughts, on what could
be the issue or approach that can be taken to improve the performance?

Thanks
VJ


Re: Column names in ORC file

2016-12-15 Thread Owen O'Malley
Yes, it was fixed in HIVE-4243.

.. Owen

On Thu, Dec 15, 2016 at 10:21 AM, Elliot West  wrote:

> Possibly related to HIVE-4243 which was fixed in Hive 2.0.0:
> https://issues.apache.org/jira/browse/HIVE-4243
>
>
> On Thu, 15 Dec 2016 at 18:06, Daniel Haviv  com> wrote:
>
>> Hi,
>> When I'm generating ORC files using spark the column names are written
>> into the ORC file but when generated using Hive I get the following column
>> names:
>>
>> _col107, _col33, _col23, _col102
>>
>>
>> Is it possible to somehow configure hive to properly store the column names 
>> like Spark?
>>
>>
>> Thank you,
>>
>> Daniel
>>
>>
>>
>>


Re: Column names in ORC file

2016-12-15 Thread Elliot West
Possibly related to HIVE-4243 which was fixed in Hive 2.0.0:
https://issues.apache.org/jira/browse/HIVE-4243


On Thu, 15 Dec 2016 at 18:06, Daniel Haviv 
wrote:

> Hi,
> When I'm generating ORC files using spark the column names are written
> into the ORC file but when generated using Hive I get the following column
> names:
>
> _col107, _col33, _col23, _col102
>
>
> Is it possible to somehow configure hive to properly store the column names 
> like Spark?
>
>
> Thank you,
>
> Daniel
>
>
>
>


Column names in ORC file

2016-12-15 Thread Daniel Haviv
Hi,
When I'm generating ORC files using spark the column names are written into
the ORC file but when generated using Hive I get the following column names:

_col107, _col33, _col23, _col102


Is it possible to somehow configure hive to properly store the column
names like Spark?


Thank you,

Daniel


Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West
I notice that HPL/SQL is not mentioned on the page I referenced, however I
expect that is another approach that you could use to modularise:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
http://www.hplsql.org/doc

On 15 December 2016 at 17:17, Elliot West  wrote:

> Some options are covered here, although there is no definitive guidance as
> far as I know:
>
> https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#
> UnitTestingHiveSQL-Modularisation
>
> On 15 December 2016 at 17:08, Saumitra Shahapure <
> saumitra.offic...@gmail.com> wrote:
>
>> Hello,
>>
>> We are running and maintaining quite big and complex Hive SELECT query
>> right now. It's basically a single SELECT query which performs JOIN of
>> about ten other SELECT query outputs.
>>
>> A simplest way to refactor that we can think of is to break this query
>> down into multiple views and then join the views. There is similar
>> possibility to create intermediate tables.
>>
>> However creating multiple DDLs in order to maintain a single DML is not
>> very smooth. We would end up polluting metadata database by creating views
>> / intermediate tables which are used in just this ETL.
>>
>> What are the other efficient ways to maintain complex SQL queries written
>> in Hive? Are there better ways to break Hive query into multiple modules?
>>
>> -- Saumitra S. Shahapure
>>
>
>


Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West
Some options are covered here, although there is no definitive guidance as
far as I know:

https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#UnitTestingHiveSQL-Modularisation

On 15 December 2016 at 17:08, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hello,
>
> We are running and maintaining quite big and complex Hive SELECT query
> right now. It's basically a single SELECT query which performs JOIN of
> about ten other SELECT query outputs.
>
> A simplest way to refactor that we can think of is to break this query
> down into multiple views and then join the views. There is similar
> possibility to create intermediate tables.
>
> However creating multiple DDLs in order to maintain a single DML is not
> very smooth. We would end up polluting metadata database by creating views
> / intermediate tables which are used in just this ETL.
>
> What are the other efficient ways to maintain complex SQL queries written
> in Hive? Are there better ways to break Hive query into multiple modules?
>
> -- Saumitra S. Shahapure
>


Maintaining big and complex Hive queries

2016-12-15 Thread Saumitra Shahapure
Hello,

We are running and maintaining quite big and complex Hive SELECT query
right now. It's basically a single SELECT query which performs JOIN of
about ten other SELECT query outputs.

A simplest way to refactor that we can think of is to break this query down
into multiple views and then join the views. There is similar possibility
to create intermediate tables.

However creating multiple DDLs in order to maintain a single DML is not
very smooth. We would end up polluting metadata database by creating views
/ intermediate tables which are used in just this ETL.

What are the other efficient ways to maintain complex SQL queries written
in Hive? Are there better ways to break Hive query into multiple modules?

-- Saumitra S. Shahapure


Re: [ANNOUNCE] New Hive Committer - Rajesh Balamohan

2016-12-15 Thread Peter Vary
Congratulations Rajesh!

> On Dec 15, 2016, at 6:40 AM, Rui Li  wrote:
> 
> Congratulations :)
> 
> On Thu, Dec 15, 2016 at 6:50 AM, Gunther Hagleitner <
> ghagleit...@hortonworks.com> wrote:
> 
>> Congrats Rajesh!
>> 
>> From: Jimmy Xiang 
>> Sent: Wednesday, December 14, 2016 11:38 AM
>> To: user@hive.apache.org
>> Cc: d...@hive.apache.org; rbalamo...@apache.org
>> Subject: Re: [ANNOUNCE] New Hive Committer - Rajesh Balamohan
>> 
>> Congrats, Rajesh!!
>> 
>> On Wed, Dec 14, 2016 at 11:32 AM, Sergey Shelukhin
>>  wrote:
>>> Congratulations!
>>> 
>>> From: Chao Sun 
>>> Reply-To: "user@hive.apache.org" 
>>> Date: Wednesday, December 14, 2016 at 10:52
>>> To: "d...@hive.apache.org" 
>>> Cc: "user@hive.apache.org" , "
>> rbalamo...@apache.org"
>>> 
>>> Subject: Re: [ANNOUNCE] New Hive Committer - Rajesh Balamohan
>>> 
>>> Congrats Rajesh!
>>> 
>>> On Wed, Dec 14, 2016 at 9:26 AM, Vihang Karajgaonkar <
>> vih...@cloudera.com>
>>> wrote:
 
 Congrats Rajesh!
 
 On Wed, Dec 14, 2016 at 1:54 AM, Jesus Camacho Rodriguez <
 jcamachorodrig...@hortonworks.com> wrote:
 
> Congrats Rajesh, well deserved! :)
> 
> --
> Jesús
> 
> 
> 
> 
> On 12/14/16, 8:41 AM, "Lefty Leverenz" 
>> wrote:
> 
>> Congratulations Rajesh!
>> 
>> -- Lefty
>> 
>> 
>> On Tue, Dec 13, 2016 at 11:58 PM, Rajesh Balamohan
>> > 
>> wrote:
>> 
>>> Thanks a lot for providing this opportunity and to all for their
> messages.
>>> :)
>>> 
>>> ~Rajesh.B
>>> 
>>> On Wed, Dec 14, 2016 at 11:33 AM, Dharmesh Kakadia
>>> > 
>>> wrote:
>>> 
 Congrats Rajesh !
 
 Thanks,
 Dharmesh
 
 On Tue, Dec 13, 2016 at 7:37 PM, Vikram Dixit K <
> vikram.di...@gmail.com>
 wrote:
 
> Congrats Rajesh! :)
> 
> On Tue, Dec 13, 2016 at 9:36 PM, Pengcheng Xiong
> 
> wrote:
> 
>> Congrats Rajesh! :)
>> 
>> On Tue, Dec 13, 2016 at 6:51 PM, Prasanth Jayachandran <
>> prasan...@apache.org
>>> wrote:
>> 
>>> The Apache Hive PMC has voted to make Rajesh Balamohan a
> committer on
>> the
>>> Apache Hive Project. Please join me in congratulating Rajesh.
>>> 
>>> Congratulations Rajesh!
>>> 
>>> Thanks
>>> Prasanth
>> 
> 
> 
> 
> --
> Nothing better than when appreciated for hard work.
> -Mark
> 
 
 
>>> 
> 
>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Best regards!
> Rui Li
> Cell: (+86) 13564950210