[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore

2020-08-04 Thread Shen Yinjie (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170597#comment-17170597
 ] 

Shen Yinjie commented on YARN-9826:
---

Is there any progress on this issue? :)

> Blocked threads at EntityGroupFSTimelineStore#getCachedStore
> 
>
> Key: YARN-9826
> URL: https://issues.apache.org/jira/browse/YARN-9826
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have observed this case several times on our production cluster where 100s 
> of TimelineServer threads are blocked at the following synchronized block in 
> EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under 
> high load.
> {code:java}
> synchronized (this.cachedLogs) {
>   // Note that the content in the cache log storage may be stale.
>   cacheItem = this.cachedLogs.get(groupId);
>   if (cacheItem == null) {
> LOG.debug("Set up new cache item for id {}", groupId);
> cacheItem = new EntityCacheItem(groupId, getConfig());
> AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId());
> if (appLogs != null) {
>   LOG.debug("Set applogs {} for group id {}", appLogs, groupId);
>   cacheItem.setAppLogs(appLogs);
>   this.cachedLogs.put(groupId, cacheItem);
> } else {
>   LOG.warn("AppLogs for groupId {} is set to null!", groupId);
> }
>   }
> }
> {code}
> One thread inside the synchronized block performs multiple fs operations 
> (fs.exists) inside getAndSetAppLogs, which could block other threads when, 
> for instance, the NameNode RPC queue is full.
> One possible solution is to move getAndSetAppLogs outside the synchronized 
> block.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore

2019-09-19 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934068#comment-16934068
 ] 

Akira Ajisaka commented on YARN-9826:
-

bq. I don't see any side effects with that change.

There are no side effects, but there may be duplicate log creations.
I think we can use another lock object to avoid duplicate operations as follows:

{code}
  private final Object fsOpLock = new Object();
(snip)
// Note that the content in the cache log storage may be stale.
cacheItem = this.cachedLogs.get(groupId);
// If the cache already exists, we don't need to hold any locks.
if (cacheItem == null) {
  // Use lock to serialize fs operations
  synchronized(fsOpLock) {
// Recheck cache to avoid duplicate fs operations
cacheItem = this.cachedLogs.get(groupId);
if (cacheItem == null) {
  LOG.debug("Set up new cache item for id {}", groupId);
  cacheItem = new EntityCacheItem(groupId, getConfig());
  AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId());
  if (appLogs != null) {
LOG.debug("Set applogs {} for group id {}", appLogs, groupId);
cacheItem.setAppLogs(appLogs);
this.cachedLogs.put(groupId, cacheItem);
  } else {
LOG.warn("AppLogs for groupId {} is set to null!", groupId);
  }
}
  }
}
{code}

> Blocked threads at EntityGroupFSTimelineStore#getCachedStore
> 
>
> Key: YARN-9826
> URL: https://issues.apache.org/jira/browse/YARN-9826
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have observed this case several times on our production cluster where 100s 
> of TimelineServer threads are blocked at the following synchronized block in 
> EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under 
> high load.
> {code:java}
> synchronized (this.cachedLogs) {
>   // Note that the content in the cache log storage may be stale.
>   cacheItem = this.cachedLogs.get(groupId);
>   if (cacheItem == null) {
> LOG.debug("Set up new cache item for id {}", groupId);
> cacheItem = new EntityCacheItem(groupId, getConfig());
> AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId());
> if (appLogs != null) {
>   LOG.debug("Set applogs {} for group id {}", appLogs, groupId);
>   cacheItem.setAppLogs(appLogs);
>   this.cachedLogs.put(groupId, cacheItem);
> } else {
>   LOG.warn("AppLogs for groupId {} is set to null!", groupId);
> }
>   }
> }
> {code}
> One thread inside the synchronized block performs multiple fs operations 
> (fs.exists) inside getAndSetAppLogs, which could block other threads when, 
> for instance, the NameNode RPC queue is full.
> One possible solution is to move getAndSetAppLogs outside the synchronized 
> block.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore

2019-09-17 Thread Harunobu Daikoku (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931218#comment-16931218
 ] 

Harunobu Daikoku commented on YARN-9826:


[~Prabhu Joseph] Although multiple threads could indeed execute 
getAndSetAppLogs with the same app id at the same time, I don't see any side 
effects with that change.

AFAIK, the following is the only piece of code which could modify states inside 
getAndSetAppLogs:
{code:java}
  if (appState != AppState.UNKNOWN) {
LOG.debug("Create and try to add new appLogs to appIdLogMap for {}",
applicationId);
appLogs = createAndPutAppLogsIfAbsent(
applicationId, appDirPath, appState);
  }
{code}

Apparently createAndPutAppLogsIfAbsent atomically updates appIdLogMap with 
ConcurrentMap#putIfAbsent(), so this doesn't have to be synchronized on 
this.cachedLogs.

> Blocked threads at EntityGroupFSTimelineStore#getCachedStore
> 
>
> Key: YARN-9826
> URL: https://issues.apache.org/jira/browse/YARN-9826
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have observed this case several times on our production cluster where 100s 
> of TimelineServer threads are blocked at the following synchronized block in 
> EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under 
> high load.
> {code:java}
> synchronized (this.cachedLogs) {
>   // Note that the content in the cache log storage may be stale.
>   cacheItem = this.cachedLogs.get(groupId);
>   if (cacheItem == null) {
> LOG.debug("Set up new cache item for id {}", groupId);
> cacheItem = new EntityCacheItem(groupId, getConfig());
> AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId());
> if (appLogs != null) {
>   LOG.debug("Set applogs {} for group id {}", appLogs, groupId);
>   cacheItem.setAppLogs(appLogs);
>   this.cachedLogs.put(groupId, cacheItem);
> } else {
>   LOG.warn("AppLogs for groupId {} is set to null!", groupId);
> }
>   }
> }
> {code}
> One thread inside the synchronized block performs multiple fs operations 
> (fs.exists) inside getAndSetAppLogs, which could block other threads when, 
> for instance, the NameNode RPC queue is full.
> One possible solution is to move getAndSetAppLogs outside the synchronized 
> block.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore

2019-09-14 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929749#comment-16929749
 ] 

Prabhu Joseph commented on YARN-9826:
-

[~hdaikoku] When getAndSetAppLogs moved outside, there are chances that 
multiple threads performs that for same applicationId.

> Blocked threads at EntityGroupFSTimelineStore#getCachedStore
> 
>
> Key: YARN-9826
> URL: https://issues.apache.org/jira/browse/YARN-9826
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have observed this case several times on our production cluster where 100s 
> of TimelineServer threads are blocked at the following synchronized block in 
> EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under 
> high load.
> {code:java}
> synchronized (this.cachedLogs) {
>   // Note that the content in the cache log storage may be stale.
>   cacheItem = this.cachedLogs.get(groupId);
>   if (cacheItem == null) {
> LOG.debug("Set up new cache item for id {}", groupId);
> cacheItem = new EntityCacheItem(groupId, getConfig());
> AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId());
> if (appLogs != null) {
>   LOG.debug("Set applogs {} for group id {}", appLogs, groupId);
>   cacheItem.setAppLogs(appLogs);
>   this.cachedLogs.put(groupId, cacheItem);
> } else {
>   LOG.warn("AppLogs for groupId {} is set to null!", groupId);
> }
>   }
> }
> {code}
> One thread inside the synchronized block performs multiple fs operations 
> (fs.exists) inside getAndSetAppLogs, which could block other threads when, 
> for instance, the NameNode RPC queue is full.
> One possible solution is to move getAndSetAppLogs outside the synchronized 
> block.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9826) Blocked threads at EntityGroupFSTimelineStore#getCachedStore

2019-09-14 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929744#comment-16929744
 ] 

Prabhu Joseph commented on YARN-9826:
-

[~hdaikoku] cachedLogs is a Collections.synchronizedMap. Does synchronization 
block required while accessing the map.

cc [~tarunparimi].

> Blocked threads at EntityGroupFSTimelineStore#getCachedStore
> 
>
> Key: YARN-9826
> URL: https://issues.apache.org/jira/browse/YARN-9826
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Harunobu Daikoku
>Priority: Minor
>
> We have observed this case several times on our production cluster where 100s 
> of TimelineServer threads are blocked at the following synchronized block in 
> EntityGroupFSTimelineStore#getCachedStore when our HDFS NameNode is under 
> high load.
> {code:java}
> synchronized (this.cachedLogs) {
>   // Note that the content in the cache log storage may be stale.
>   cacheItem = this.cachedLogs.get(groupId);
>   if (cacheItem == null) {
> LOG.debug("Set up new cache item for id {}", groupId);
> cacheItem = new EntityCacheItem(groupId, getConfig());
> AppLogs appLogs = getAndSetAppLogs(groupId.getApplicationId());
> if (appLogs != null) {
>   LOG.debug("Set applogs {} for group id {}", appLogs, groupId);
>   cacheItem.setAppLogs(appLogs);
>   this.cachedLogs.put(groupId, cacheItem);
> } else {
>   LOG.warn("AppLogs for groupId {} is set to null!", groupId);
> }
>   }
> }
> {code}
> One thread inside the synchronized block performs multiple fs operations 
> (fs.exists) inside getAndSetAppLogs, which could block other threads when, 
> for instance, the NameNode RPC queue is full.
> One possible solution is to move getAndSetAppLogs outside the synchronized 
> block.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org