Re: Tez GC issues perhaps? not sure.

2016-12-14 Thread Stephen Sprague
ah. 2016-12-14 14:05:07,855 [WARN] [AMShutdownThread] |ats.ATSHistoryLoggingService|: ATSService being stopped, eventQueueBacklog=14820, maxTimeLeftToFlush=-1, waitForever=true 2016-12-14 14:05:37,877 [ERROR] [AMShutdownThread] |impl.TimelineClientImpl|: Failed to get the response from the timelin

Re: Tez GC issues perhaps? not sure.

2016-12-14 Thread Gopal Vijayaraghavan
> looking at the stderr of that one container hanging around we have this below. Look in the syslog for a log line which starts with ATSService being stopped, eventQueueBacklog=…, waitForever=true Cheers, Gopal

Re: Tez GC issues perhaps? not sure.

2016-12-14 Thread Stephen Sprague
first pass: 1. changing yarn.timeline-service.ttl-enable to false didn't seem work. i restarted the TLS and HS2 and RM. and the query still stuck around. 2. figure i'd try using RollingLevelDbTimelineStore but got class not found so i'll dig around for that later today. current settings f

Re: Tez GC issues perhaps? not sure.

2016-12-14 Thread Stephen Sprague
Thanks Gopal. I'll set the ttl flag to false and see what gives. Cheers, Stephen On Tue, Dec 13, 2016 at 10:48 PM, Gopal Vijayaraghavan wrote: > > yarn.timeline-service.ttl-enable=true > > Let us validate that this is due to the TTL GC kicking in and disable the > TTL flag & leave it running f

Re: Tez GC issues perhaps? not sure.

2016-12-13 Thread Gopal Vijayaraghavan
> yarn.timeline-service.ttl-enable=true Let us validate that this is due to the TTL GC kicking in and disable the TTL flag & leave it running for a day. Better to also verify the Tez logs of sessions hanging along waiting for the ATS to collect events (look for the last _post log file in the AM

Re: Tez GC issues perhaps? not sure.

2016-12-13 Thread Stephen Sprague
aha. i sense we're getting closer. here are my settings for yarn.timeline-service.* yarn.timeline-service.address=${yarn.timeline-service.hostname}:10200 yarn.timeline-service.client.max-retries=30 yarn.timeline-service.client.retry-interval-ms=1000 yarn.timeline-service.enabled=true yarn.timelin

Re: Tez GC issues perhaps? not sure.

2016-12-13 Thread Gopal Vijayaraghavan
> well we are seeing these sessions sitting around for over an hour This could be one of the causes for this issue - a stuck ATS. Tez won't kill a session till all the ATS info has been submitted out of the process. RollingLevelDbTimelineStore & EntityGroupFSTimelineStore was written to fix t

Re: Tez GC issues perhaps? not sure.

2016-12-13 Thread Harish JP
AFAIK, HS2 uses a pool of AMs and submits query to any free AM. There should be configs which control number of free AMs, timeout and so on for the pool used by HS2. On 14-Dec-2016, at 7:54 AM, Stephen Sprague mailto:sprag...@gmail.com>> wrote: interesting thank you. pretty sure they ar

Re: Tez GC issues perhaps? not sure.

2016-12-13 Thread Stephen Sprague
i didn't mean to hit send just yet. well we are seeing these sessions sitting around for over an hour - yet i don't see that config set so perhaps the default 5 minutes might not be in play in my case. settings i do see are: set : hive.cli.tez.session.async=true set

Re: Tez GC issues perhaps? not sure.

2016-12-13 Thread Stephen Sprague
interesting thank you. pretty sure they are being submitted through the HS2 service. On Tue, Dec 13, 2016 at 5:21 PM, Harish JP wrote: > Hi Stephen, > > How are you starting these jobs, beeline, hive-cli, ...? It looks like > they are being started in session mode, which means the AM wait

Re: Tez GC issues perhaps? not sure.

2016-12-13 Thread Harish JP
Hi Stephen, How are you starting these jobs, beeline, hive-cli, ...? It looks like they are being started in session mode, which means the AM waits for 5 minutes (default value) for a new DAG/query to be submitted, if it does not receive a query it will timeout and shutdown. The config for thi