[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=448478&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-448478 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 19/Jun/20 13:36 Start Date: 19/Jun/20 13:36 Worklog Time Spent: 10m Work Description: belugabehr merged pull request #807: URL: https://github.com/apache/hive/pull/807 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 448478) Time Spent: 2h 40m (was: 2.5h) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 2h 40m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=448475&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-448475 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 19/Jun/20 13:31 Start Date: 19/Jun/20 13:31 Worklog Time Spent: 10m Work Description: belugabehr commented on pull request #807: URL: https://github.com/apache/hive/pull/807#issuecomment-646637122 I am going to merge to master. We can deal with logging in another task. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 448475) Time Spent: 2.5h (was: 2h 20m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=445016&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-445016 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 12/Jun/20 16:19 Start Date: 12/Jun/20 16:19 Worklog Time Spent: 10m Work Description: belugabehr commented on pull request #807: URL: https://github.com/apache/hive/pull/807#issuecomment-643361096 Closed & Re-opened to kick of full test suite. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 445016) Time Spent: 2h 20m (was: 2h 10m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 2h 20m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=445014&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-445014 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 12/Jun/20 16:18 Start Date: 12/Jun/20 16:18 Worklog Time Spent: 10m Work Description: belugabehr closed pull request #807: URL: https://github.com/apache/hive/pull/807 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 445014) Time Spent: 2h (was: 1h 50m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 2h > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=445015&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-445015 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 12/Jun/20 16:18 Start Date: 12/Jun/20 16:18 Worklog Time Spent: 10m Work Description: xhumanoid opened a new pull request #807: URL: https://github.com/apache/hive/pull/807 Remove synchronization on cache retrieve have big performance impact will be great to have in branch-3 also This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 445015) Time Spent: 2h 10m (was: 2h) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 2h 10m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=441282&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441282 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 04/Jun/20 13:50 Start Date: 04/Jun/20 13:50 Worklog Time Spent: 10m Work Description: belugabehr commented on a change in pull request #807: URL: https://github.com/apache/hive/pull/807#discussion_r435273094 ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); if(cache.containsKey(hv)) { Review comment: OK. That makes sense. Thanks for running that down. If you have a schema composed of several sub-schemas, then storing the root schema in the cache will trigger the creation of all the sub-schemas and those will also be inserted into the cache while the the initial insert still has the lock. With that said, I don't think it's worth changing the current construct. Unless there is some sort of demonstrable performance value here, I think it's confusing to use a `ConcurrentHashMap` but not be able to use the `computeX()` routine. If you would like, you can clean up the logging, shrink the synchronized block (instead of the entire method), and make some simplifications like... ``` Instance instance = cache.get(hv); if (o == null) { instance = makeInstance(hv, seenSchemas); cache.put(hv, instance); } return instance; ``` In that way, the synchronized lock will be held for shorter intervals. Oh, and obviously need JavaDoc here to explain the reentrant piece. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 441282) Time Spent: 1h 50m (was: 1h 40m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 1h 50m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=441014&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441014 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 03/Jun/20 22:45 Start Date: 03/Jun/20 22:45 Worklog Time Spent: 10m Work Description: mythrocks commented on a change in pull request #807: URL: https://github.com/apache/hive/pull/807#discussion_r434898293 ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); if(cache.containsKey(hv)) { Review comment: Ok, I think I have it. [`InstanceCache::retrieve()`](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java#L66) modifies the cache here: ```java Instance instance = makeInstance(hv, seenSchemas); cache.put(hv, instance); return instance; ``` We would like to use `computeIfAbsent()` instead: ```java cache.computeIfAbsent(hv, ()->makeInstance(hv, seenSchemas)); ``` If `makeInstance()` ends up modifying `cache`, (e.g. by calling `retrieve()`), that would cause undefined behaviour in`computeIfAbsent()`. [`SchemaToTypeInfo`](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/avro/SchemaToTypeInfo.java#L116) implements `InstanceCache::makeInstance()` as follows: ```java static InstanceCache typeInfoCache = new InstanceCache() { @Override protected TypeInfo makeInstance(Schema s, Set seenSchemas) throws AvroSerdeException { return generateTypeInfoWorker(s, seenSchemas); // < HERE! } }; ``` Please note that [`SchemaToTypeInfo::generateTypeInfo()`](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/avro/SchemaToTypeInfo.java#L186) happens to call `InstanceCache::retrieve()`: ```java public static TypeInfo generateTypeInfo(Schema schema, Set seenSchemas) throws AvroSerdeException { // ... return typeInfoCache.retrieve(schema, seenSchemas); } ``` Hence, the recursive call. :/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 441014) Time Spent: 1h 40m (was: 1.5h) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 1h 40m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=441006&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441006 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 03/Jun/20 22:25 Start Date: 03/Jun/20 22:25 Worklog Time Spent: 10m Work Description: mythrocks commented on a change in pull request #807: URL: https://github.com/apache/hive/pull/807#discussion_r434891408 ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); if(cache.containsKey(hv)) { Review comment: > I poked around a bit, nothing is obviously reentrant. Sorry for the delay. I remember this being convoluted... Let me see if I can recall this. Just a moment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 441006) Time Spent: 1.5h (was: 1h 20m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=440911&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-440911 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 03/Jun/20 18:13 Start Date: 03/Jun/20 18:13 Worklog Time Spent: 10m Work Description: belugabehr commented on a change in pull request #807: URL: https://github.com/apache/hive/pull/807#discussion_r434761769 ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); if(cache.containsKey(hv)) { Review comment: I poked around a bit, nothing is obviously reentrant. Do you know where this is happening? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 440911) Time Spent: 1h 20m (was: 1h 10m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=440899&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-440899 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 03/Jun/20 17:35 Start Date: 03/Jun/20 17:35 Worklog Time Spent: 10m Work Description: mythrocks commented on a change in pull request #807: URL: https://github.com/apache/hive/pull/807#discussion_r434739477 ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); if(cache.containsKey(hv)) { Review comment: I had left [a comment about this on HIVE-19261](https://issues.apache.org/jira/browse/HIVE-19261?focusedCommentId=16929610&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16929610). I think `makeInstance()` makes reentrant calls into `InstanceCache::retrieve()`. `computeIfAbsent()` expects that `cache` not be modified in the lambda. Calling calling `makeInstance()` from `computeIfAbsent()` might lead to a hang. I ran into this when testing an older version of this patch, some months ago. It's possible that the reentrant call no longer occurs. I'd check this before using `computeIfAbsent()`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 440899) Time Spent: 1h 10m (was: 1h) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=440851&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-440851 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 03/Jun/20 16:01 Start Date: 03/Jun/20 16:01 Worklog Time Spent: 10m Work Description: belugabehr commented on a change in pull request #807: URL: https://github.com/apache/hive/pull/807#discussion_r434678396 ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); Review comment: Please change this (and the rest) to: ` LOG.debug("Checking for hv: {}", hv); ` http://www.slf4j.org/faq.html#logging_performance ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); if(cache.containsKey(hv)) { Review comment: Please leverage `Map#computeIfAbsent` here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 440851) Time Spent: 50m (was: 40m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 50m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=440852&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-440852 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 03/Jun/20 16:01 Start Date: 03/Jun/20 16:01 Worklog Time Spent: 10m Work Description: belugabehr commented on pull request #807: URL: https://github.com/apache/hive/pull/807#issuecomment-638292070 Hive project is testing out using GitHub now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 440852) Time Spent: 1h (was: 50m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Alexey Diomin >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 1h > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=326191&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326191 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 10/Oct/19 08:59 Start Date: 10/Oct/19 08:59 Worklog Time Spent: 10m Work Description: dlavati commented on pull request #807: HIVE-19261: Avro SerDe's InstanceCache should not be synchronized on retrieve URL: https://github.com/apache/hive/pull/807#discussion_r333403943 ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); if(cache.containsKey(hv)) { if(LOG.isDebugEnabled()) LOG.debug("Returning cache result."); return cache.get(hv); +} else { + if(LOG.isDebugEnabled()) LOG.debug("Creating new instance and storing in cache"); Review comment: @xhumanoid thanks for the contribution! I don't have committer rights to approve your patch, but LGTM with a few comments: - a small nitpick, but AFAIK, `isDebugEnabled` doesn't provide benefits here, only if you'd append/toString an expensive operation's result (yes, this is also ignored in a lot of places 😄) - to approve (+1) your changes, you'll have to become the assignee of the jira ticket that's currently under Fangshi Li (if that's fine with him) and change it's state with `Submit a patch` with an attached patch file (the previously attached file didn't get checked, as the issue state didn't change) - once all tests pass for it (might require a few reattachments/reruns), then someone will be able to approve it (also see https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-CreatingaPatch) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 326191) Time Spent: 40m (was: 0.5h) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Fangshi Li >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 40m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=326190&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326190 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 10/Oct/19 08:57 Start Date: 10/Oct/19 08:57 Worklog Time Spent: 10m Work Description: dlavati commented on pull request #807: HIVE-19261: Avro SerDe's InstanceCache should not be synchronized on retrieve URL: https://github.com/apache/hive/pull/807#discussion_r333403943 ## File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java ## @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws AvroSerdeException { * Retrieve (or create if it doesn't exist) the correct Instance for this * SeedObject using 'seenSchemas' to resolve circular references */ - public synchronized Instance retrieve(SeedObject hv, - Set seenSchemas) throws AvroSerdeException { + public Instance retrieve(SeedObject hv, Set seenSchemas) +throws AvroSerdeException { if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString()); if(cache.containsKey(hv)) { if(LOG.isDebugEnabled()) LOG.debug("Returning cache result."); return cache.get(hv); +} else { + if(LOG.isDebugEnabled()) LOG.debug("Creating new instance and storing in cache"); Review comment: @xhumanoid thanks for the contribution! I don't have committer rights to approve your patch, but LGTM with a few comments: - a small nitpick, but AFAIK, `isDebugEnabled` doesn't provide benefits here, only if you'd append/toString an expensive operation's result - to approve (+1) your changes, you'll have to become the assignee of the jira ticket that's currently under Fangshi Li (if that's fine with him) and change it's state with `Submit a patch` with an attached patch file (the previously attached file didn't get checked, as the issue state didn't change) - once all tests pass for it (might require a few reattachments/reruns), then someone will be able to approve it (also see https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-CreatingaPatch) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 326190) Time Spent: 0.5h (was: 20m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Fangshi Li >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=326158&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326158 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 10/Oct/19 07:50 Start Date: 10/Oct/19 07:50 Worklog Time Spent: 10m Work Description: xhumanoid commented on issue #807: HIVE-19261: Avro SerDe's InstanceCache should not be synchronized on retrieve URL: https://github.com/apache/hive/pull/807#issuecomment-540443922 @kgyrtkirk @dlavati could you review? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 326158) Time Spent: 20m (was: 10m) > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Fangshi Li >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 20m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve
[ https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=325789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325789 ] ASF GitHub Bot logged work on HIVE-19261: - Author: ASF GitHub Bot Created on: 09/Oct/19 16:04 Start Date: 09/Oct/19 16:04 Worklog Time Spent: 10m Work Description: xhumanoid commented on pull request #807: HIVE-19261: Avro SerDe's InstanceCache should not be synchronized on retrieve URL: https://github.com/apache/hive/pull/807 Remove synchronization on cache retrieve have big performance impact will be great to have in branch-3 also This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 325789) Remaining Estimate: 0h Time Spent: 10m > Avro SerDe's InstanceCache should not be synchronized on retrieve > - > > Key: HIVE-19261 > URL: https://issues.apache.org/jira/browse/HIVE-19261 > Project: Hive > Issue Type: Improvement >Reporter: Fangshi Li >Assignee: Fangshi Li >Priority: Major > Labels: pull-request-available > Attachments: HIVE-19261.1.patch > > Time Spent: 10m > Remaining Estimate: 0h > > In HIVE-16175, upstream made a patch to fix the thread safety issue in > AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache > synchronized. While it should make InstanceCache thread-safe, making retrieve > synchronized for the cache can be expensive in highly concurrent environment > like Spark, as multiple threads need to be synchronized on entering the > entire retrieve method. > We are proposing another way to fix this thread safety issue by making the > underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use > atomic computeIfAbsent in the retrieve method to avoid synchronizing the > entire method. > While computeIfAbsent is only available on java 8 and java 7 is still > supported in Hive, > we use a pattern to simulate the behavior of computeIfAbsent. In the future, > we should move to computeIfAbsent when Hive requires java 8. -- This message was sent by Atlassian Jira (v8.3.4#803005)