Re: hive.root.logger influencing query plan?? so it's not so

2016-08-31 Thread Vladimir Kroz
One of the causes could be that on long-running queries your terminal
session gets disconnected and client process terminate (appearing like
 query hangs).

When debug messages are on, they will keep terminal session alive and hence
allowing your query to complete.

I'm not sure if this is your case, but based on synopsis your provided it
could be one of the reasons.

Cheers,
Vlad

---
From: Stephen Sprague 
To: "user@hive.apache.org" 
Cc:
Date: Tue, 30 Aug 2016 20:28:50 -0700
Subject: hive.root.logger influencing query plan?? so it's not so
Hi guys,
I've banged my head on this one all day and i need to surrender.  I have a
query that hangs (never returns). However, when i turn on logging to DEBUG
level it works.  I'm stumped.   I include here the query, the different
query plans (with the only thing different being the log level) and a
traceback on the query that hangs.

* the query

#!/bin/bash

sql="
explain SELECT *
FROM   raw_logs.fact_www_access_datekey_by_hour
WHERE  date_key = 20160828  -- partition key
and lower(http_string) = 'foo'  -- always evaluates to false
limit 1 -- if i remove limit 1 it works.  I expect
an empty result set.
;
"

#hive -hiveconf hive.root.logger=ERROR,console -e "$sql" >stdout.bad
2>stderr.bad  #hangs
hive -hiveconf hive.root.logger=DEBUG,console -e "$sql" >stdout.good
2>stderr.good #works



* explain plan for the one that hangs (ERROR,console)

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: 1
  Processor Tree:
TableScan
  alias: fact_www_access_datekey_by_hour
  Statistics: Num rows: 144355924 Data size: 158755813572 Basic
stats: COMPLETE Column stats: NONE
  Filter Operator
predicate: (lower(http_string) = 'foo') (type: boolean)
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Select Operator
  expressions: payload_year (type: int), payload_month (type:
int), payload_day (type: int), payload_time (type: string),
payload_gmt_offset (type: string), perf_time_micros (type: int),
http_string (type: string), http_type (type: string), http_version (type:
string), http_rc (type: int), http_size (type: int), referrer (type:
string), user_agent (type: string), bot_flag (type: string), ip_list (type:
array), scalar_tags (type: map), url_components
(type: array), user_id (type: string), lt_session_id (type:
string), session_id (type: string), log_name (type: string), log_lineno
(type: int), log_line (type: string), web_server (type: string),
location_info (type: map), dp_main_url_type (type: string),
dp_inventory_type (type: string), dp_main_url (type:
map>), dp_tag_url (type: map>),
dp_referrer_url (type: map>), hour (type: int),
20160828 (type: int)
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15,
_col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24,
_col25, _col26, _col27, _col28, _col29, _col30, _col31
  Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
  Limit
Number of rows: 1
Statistics: Num rows: 1 Data size: 1099 Basic stats:
COMPLETE Column stats: NONE
ListSink

* query plan for the one that works (DEBUG,console)

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: fact_www_access_datekey_by_hour
Statistics: Num rows: 144355924 Data size: 158755813572 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
  predicate: (lower(http_string) = 'foo') (type: boolean)
  Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: payload_year (type: int), payload_month (type:
int), payload_day (type: int), payload_time (type: string),
payload_gmt_offset (type: string), perf_time_micros (type: int),
http_string (type: string), http_type (type: string), http_version (type:
string), http_rc (type: int), http_size (type: int), referrer (type:
string), user_agent (type: string), bot_flag (type: string), ip_list (type:
array), scalar_tags (type: map), url_components
(type: array), user_id (type: string), lt_session_id (type:
string), session_id (type: string), log_name (type: string), log_lineno
(type: int), log_line (type: string), web_server (type: string),
location_info (type: map), dp_main_url_type (type: string),
dp_inventory_type (type: string), dp_main_url (type:
map>), dp_tag_url (type: map>),
dp_referrer_url (type: map>), hour (type: int),
20160828 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5,

Re: Java Heap Memory OOM when using ORCNewInputFormat in MR

2016-08-31 Thread Hank baker
No, I am not using dynamic partitioning.

On Wed, Aug 31, 2016 at 4:00 AM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> In hive 1.2.1 the automatic estimation of buffer size happens only if
> column count is >1000.
> You need https://issues.apache.org/jira/browse/HIVE-11807 for automatic
> estimation by default or >=Hive 2.0 release.
>
> Are you using dynamic partitioning in hive?
>
> Thanks
> Prasanth
>
>
> On Aug 30, 2016, at 3:22 PM, Hank baker  wrote:
>
> Hi Prasanth,
>
> Thanks for your quick reply. As I mentioned in the previous mail, this was
> the same stack trace in about 60 failed reducers. I am using Hive 1.2.1,
> not sure which newer version you are referring to.
>
> But exactly as you pointed out, When I tried to reproduce this issue on my
> local setup by simply writing a large number of column, the stacktrace did
> vary.
>
> Also, from the WriterImpl code, it appears that the stripes have already
> been flushed before metadata is written. I may be mistaken, please correct
> me if I'm wrong. This is one the reasons I believe that this is more than
> just a simple memory issue related to columns.
>
>
> On Wed, Aug 31, 2016 at 3:42 AM, Prasanth Jayachandran <
> pjayachand...@hortonworks.com> wrote:
>
>> Under memory pressure, the stack trace of OOM can be different depending
>> on who is requesting more memory when the memory is already full. That is
>> the reason you are seeing OOM in writeMetadata (it may happen in other
>> places as well). When dealing with thousands of columns its better to set
>> hiv.exec.orc.default.buffer.size to lower value until you can avoid OOM.
>> Depending on the version of hive you are using, this may be set
>> automatically for you. In older hive versions, if number of columns is
>> >1000 buffer size will be automatically chosen. In newer version, this
>> limit is removed and orc writer will figure out the optimal buffer size
>> based on stripe size, available memory and number of columns.
>>
>> Thanks
>> Prasanth
>>
>>
>> On Aug 30, 2016, at 3:04 PM, Hank baker  wrote:
>>
>> Hi all,
>>
>> I'm trying to run a map reduce job to convert csv data into orc using the
>> OrcNewOutputFormat (reduce is required to satisfy some partitioning logic)
>> but getting an OOM error at reduce phase (during merge to be exact) with
>> the below attached stacktrace for one particular table which has about 800
>> columns and this error seems common across all reducers(minimum reducer
>> input records is about 20, max. is about 100 mil). I am trying to figure
>> out the exact cause of the error since I have use the same job to convert
>> tables with 100-1 columns without any memory or config changes.
>>
>> What concerns me in the stack trace is this line:
>>
>>  at 
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.writeMetadata(WriterImpl.java:2327)
>>
>> Why is it going OOM while trying to write MetaData ?
>>
>> I originally believed this was simply due to the number of open buffers
>> (as mentioned in http://mail-archives.apache.or
>> g/mod_mbox/hive-dev/201410.mbox/%3c543d5eb6.2000...@apache.org%3E).So I
>> wrote a bit of code to reproduce the error on my local setup by creating an
>> instance of OrcRecordWriter and writing large number of columns, I did get
>> a similar heap space error, however it was going OOM while trying to flush
>> the stripes, with this in the stacktrace:
>>
>> at 
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2133)
>>
>> This issue on the dev environment got resolved by setting
>>
>> hive.exec.orc.default.buffer.size=32k
>>
>> Will the same setting work for the original error?
>>
>> For different reasons I cannot change the reducer memory or lower the
>> buffer size even at a job level. For now, I am just trying to understand
>> the source of this error. Can anyone please help?
>>
>> Original OOM stacktrace:
>>
>> FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : 
>> java.lang.OutOfMemoryError: Java heap space
>>  at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
>>  at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
>>  at 
>> org.apache.hadoop.hive.ql.io.orc.OutStream.getNewInputBuffer(OutStream.java:107)
>>  at org.apache.hadoop.hive.ql.io.orc.OutStream.write(OutStream.java:140)
>>  at 
>> com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
>>  at 
>> com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
>>  at 
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.writeMetadata(WriterImpl.java:2327)
>>  at 
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2426)
>>  at 
>> org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat$OrcRecordWriter.close(OrcNewOutputFormat.java:67)
>>  at 
>> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
>>  at 
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
>>  at org.apache.had

Quota for rogue ad-hoc queries

2016-08-31 Thread ravi teja
Hi Community,

Many users run adhoc hive queries on our platform.
Some rogue queries managed to fill up the hdfs space and causing mainstream
queries to fail.

We wanted to limit the data generated by these adhoc queries.
We are aware of strict param which limits the data being scanned, but it is
of less help as huge number of user tables aren't partitioned.

Is there a way we can limit the data generated from hive per query, like a
hve parameter for setting HDFS quotas for job level *scratch* directory or
any other approach?
What's the general approach to gaurdrail such multi-tenant cases.

Thanks in advance,
Ravi


Re: hive.root.logger influencing query plan?? so it's not so

2016-08-31 Thread Stephen Sprague
hi guys,

Vlad: good suggestion however in my case its a 5 second query (when it
works)

Gopal: Thanks for the explanation on the effect logging can have on the
execution path. somewhat counter-intuitive i must say and as you can
imagine a tad more challenging to debug - when debugging influences the
observation. :)

that said adding:

   set hive.fetch.task.conversion=none;

forced the query plans to be the same and that made the difference. it
worked.

now i need to research that setting and see what the implications are.  I'm
sure it doesn't come for free.

and as always a big thanks!  lemme know if i can provide any details.  the
partition size is >1G btw. and this is using hive 2.1.0.

Cheers,
Stephen.
ps here's the hdfs info - the table is bucketed by 24 fwiw.


$ hdfs dfs -du -h
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/00_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/01_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/02_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/03_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/04_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/05_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/06_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/07_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/08_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/09_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/10_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/11_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/12_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/13_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/14_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/15_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/16_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/17_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/18_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/19_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/20_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/21_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/22_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/23_0




On Wed, Aug 31, 2016 at 12:19 AM, Vladimir Kroz 
wrote:

> One of the causes could be that on long-running queries your terminal
> session gets disconnected and client process terminate (appearing like
>  query hangs).
>
> When debug messages are on, they will keep terminal session alive and
> hence allowing your query to complete.
>
> I'm not sure if this is your case, but based on synopsis your provided it
> could be one of the reasons.
>
> Cheers,
> Vlad
>
> ---
> From: Stephen Sprague 
> To: "user@hive.apache.org" 
> Cc:
> Date: Tue, 30 Aug 2016 20:28:50 -0700
> Subject: hive.root.logger influencing query plan?? so it's not so
> Hi guys,
> I've banged my head on this one all day and i need to surrender.  I have a
> query that hangs (never returns). However, when i turn on logging to DEBUG
> level it works.  I'm stumped.   I include here the query, the different
> query plans (with the only thing different being the log level) and a
> traceback on the query that hangs.
>
> * the query
>
> #!/bin/bash
>
> sql="
> explain SELECT *
> FROM   raw_logs.fact_www_access_datekey_by_hour
> WHERE  date_key = 20160828  -- partition key
> and lower(http_string) = 'foo'  -- always evaluates to false
> limit 1 -- if i remove limit 1 it works.  I expect
> an empty result set.
> ;
> "
>
> #hive -hiveconf hive.root.logger=ERROR,console -e "$sql" >stdout.bad
> 2>stderr.bad  #hangs
> hive -hiveconf hive.root.logger=DEBUG,console -e "$sql" >stdout.good
> 2>stderr.good #works
>
>
>
> * explain plan for the one that hangs (ERROR,console)
>
> STAGE D

Re: Quota for rogue ad-hoc queries

2016-08-31 Thread Mich Talebzadeh
Trt this

hive.limit.optimize.fetch.max

   - Default Value: 5
   - Added In: Hive 0.8.0

Maximum number of rows allowed for a smaller subset of data for simple
LIMIT, if it is a fetch query. Insert queries are not restricted by this
limit.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 August 2016 at 13:42, ravi teja  wrote:

> Hi Community,
>
> Many users run adhoc hive queries on our platform.
> Some rogue queries managed to fill up the hdfs space and causing
> mainstream queries to fail.
>
> We wanted to limit the data generated by these adhoc queries.
> We are aware of strict param which limits the data being scanned, but it
> is of less help as huge number of user tables aren't partitioned.
>
> Is there a way we can limit the data generated from hive per query, like a
> hve parameter for setting HDFS quotas for job level *scratch* directory
> or any other approach?
> What's the general approach to gaurdrail such multi-tenant cases.
>
> Thanks in advance,
> Ravi
>


Re: Quota for rogue ad-hoc queries

2016-08-31 Thread ravi teja
Thanks Mich,

Unfortunately we have many insert queries.
Are there any other ways?

Thanks,
Ravi

On Wed, Aug 31, 2016 at 9:45 PM, Mich Talebzadeh 
wrote:

> Trt this
>
> hive.limit.optimize.fetch.max
>
>- Default Value: 5
>- Added In: Hive 0.8.0
>
> Maximum number of rows allowed for a smaller subset of data for simple
> LIMIT, if it is a fetch query. Insert queries are not restricted by this
> limit.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 August 2016 at 13:42, ravi teja  wrote:
>
>> Hi Community,
>>
>> Many users run adhoc hive queries on our platform.
>> Some rogue queries managed to fill up the hdfs space and causing
>> mainstream queries to fail.
>>
>> We wanted to limit the data generated by these adhoc queries.
>> We are aware of strict param which limits the data being scanned, but it
>> is of less help as huge number of user tables aren't partitioned.
>>
>> Is there a way we can limit the data generated from hive per query, like
>> a hve parameter for setting HDFS quotas for job level *scratch*
>> directory or any other approach?
>> What's the general approach to gaurdrail such multi-tenant cases.
>>
>> Thanks in advance,
>> Ravi
>>
>
>