[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=448478&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-448478
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 19/Jun/20 13:36
Start Date: 19/Jun/20 13:36
Worklog Time Spent: 10m 
  Work Description: belugabehr merged pull request #807:
URL: https://github.com/apache/hive/pull/807


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 448478)
Time Spent: 2h 40m  (was: 2.5h)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=448475&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-448475
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 19/Jun/20 13:31
Start Date: 19/Jun/20 13:31
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on pull request #807:
URL: https://github.com/apache/hive/pull/807#issuecomment-646637122


   I am going to merge to master.  We can deal with logging in another task.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 448475)
Time Spent: 2.5h  (was: 2h 20m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=445016&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-445016
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 12/Jun/20 16:19
Start Date: 12/Jun/20 16:19
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on pull request #807:
URL: https://github.com/apache/hive/pull/807#issuecomment-643361096


   Closed & Re-opened to kick of full test suite.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 445016)
Time Spent: 2h 20m  (was: 2h 10m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=445014&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-445014
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 12/Jun/20 16:18
Start Date: 12/Jun/20 16:18
Worklog Time Spent: 10m 
  Work Description: belugabehr closed pull request #807:
URL: https://github.com/apache/hive/pull/807


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 445014)
Time Spent: 2h  (was: 1h 50m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=445015&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-445015
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 12/Jun/20 16:18
Start Date: 12/Jun/20 16:18
Worklog Time Spent: 10m 
  Work Description: xhumanoid opened a new pull request #807:
URL: https://github.com/apache/hive/pull/807


   Remove synchronization on cache retrieve have big performance impact
   will be great to have in branch-3 also



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 445015)
Time Spent: 2h 10m  (was: 2h)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=441282&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441282
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 04/Jun/20 13:50
Start Date: 04/Jun/20 13:50
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on a change in pull request #807:
URL: https://github.com/apache/hive/pull/807#discussion_r435273094



##
File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
##
@@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());
 
 if(cache.containsKey(hv)) {

Review comment:
   OK.  That makes sense.  Thanks for running that down.
   
   If you have a schema composed of several sub-schemas, then storing the root 
schema in the cache will trigger the creation of all the sub-schemas and those 
will also be inserted into the cache while the the initial insert still has the 
lock.
   
   With that said, I don't think it's worth changing the current construct.  
Unless there is some sort of demonstrable performance value here, I think it's 
confusing to use a `ConcurrentHashMap` but not be able to use the `computeX()` 
routine.   If you would like, you can clean up the logging, shrink the 
synchronized block (instead of the entire method), and make some 
simplifications like...
   
   ```
   Instance instance = cache.get(hv);
   if (o == null) {
   instance = makeInstance(hv, seenSchemas);
   cache.put(hv, instance);
   } 
   return instance;
   ```
   
   In that way, the synchronized lock will be held for shorter intervals.  Oh, 
and obviously need JavaDoc here to explain the reentrant piece.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441282)
Time Spent: 1h 50m  (was: 1h 40m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=441014&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441014
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 03/Jun/20 22:45
Start Date: 03/Jun/20 22:45
Worklog Time Spent: 10m 
  Work Description: mythrocks commented on a change in pull request #807:
URL: https://github.com/apache/hive/pull/807#discussion_r434898293



##
File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
##
@@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());
 
 if(cache.containsKey(hv)) {

Review comment:
   Ok, I think I have it.
   
   
[`InstanceCache::retrieve()`](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java#L66)
 modifies the cache here:
   ```java
   Instance instance = makeInstance(hv, seenSchemas);
   cache.put(hv, instance);
   return instance;
   ```
   We would like to use `computeIfAbsent()` instead:
   ```java
   cache.computeIfAbsent(hv, ()->makeInstance(hv, seenSchemas));
   ```
   If `makeInstance()` ends up modifying `cache`, (e.g. by calling 
`retrieve()`), that would cause undefined behaviour in`computeIfAbsent()`.
   
   
[`SchemaToTypeInfo`](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/avro/SchemaToTypeInfo.java#L116)
 implements `InstanceCache::makeInstance()` as follows:
   ```java
 static InstanceCache typeInfoCache = new 
InstanceCache() {
 @Override
 protected TypeInfo makeInstance(Schema s,
 Set seenSchemas)
 throws AvroSerdeException {
   return generateTypeInfoWorker(s, 
seenSchemas); // < HERE!
 }
   };
   ```
   
   Please note that 
[`SchemaToTypeInfo::generateTypeInfo()`](https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/avro/SchemaToTypeInfo.java#L186)
 happens to call `InstanceCache::retrieve()`:
   ```java
 public static TypeInfo generateTypeInfo(Schema schema,
 Set seenSchemas) throws AvroSerdeException {
   // ...
   return typeInfoCache.retrieve(schema, seenSchemas);
 }
   ```
   Hence, the recursive call. :/





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441014)
Time Spent: 1h 40m  (was: 1.5h)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to 

[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=441006&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-441006
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 03/Jun/20 22:25
Start Date: 03/Jun/20 22:25
Worklog Time Spent: 10m 
  Work Description: mythrocks commented on a change in pull request #807:
URL: https://github.com/apache/hive/pull/807#discussion_r434891408



##
File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
##
@@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());
 
 if(cache.containsKey(hv)) {

Review comment:
   > I poked around a bit, nothing is obviously reentrant.
   
   Sorry for the delay. I remember this being convoluted... Let me see if I can 
recall this. Just a moment.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 441006)
Time Spent: 1.5h  (was: 1h 20m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=440911&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-440911
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 03/Jun/20 18:13
Start Date: 03/Jun/20 18:13
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on a change in pull request #807:
URL: https://github.com/apache/hive/pull/807#discussion_r434761769



##
File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
##
@@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());
 
 if(cache.containsKey(hv)) {

Review comment:
   I poked around a bit, nothing is obviously reentrant.  Do you know where 
this is happening?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 440911)
Time Spent: 1h 20m  (was: 1h 10m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=440899&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-440899
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 03/Jun/20 17:35
Start Date: 03/Jun/20 17:35
Worklog Time Spent: 10m 
  Work Description: mythrocks commented on a change in pull request #807:
URL: https://github.com/apache/hive/pull/807#discussion_r434739477



##
File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
##
@@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());
 
 if(cache.containsKey(hv)) {

Review comment:
   I had left [a comment about this on 
HIVE-19261](https://issues.apache.org/jira/browse/HIVE-19261?focusedCommentId=16929610&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16929610).
   
   I think `makeInstance()` makes reentrant calls into 
`InstanceCache::retrieve()`. 
   `computeIfAbsent()` expects that `cache` not be modified in the lambda. 
   Calling calling `makeInstance()` from `computeIfAbsent()` might lead to a 
hang. I ran into this when testing an older version of this patch, some months 
ago.
   
   It's possible that the reentrant call no longer occurs. I'd check this 
before using `computeIfAbsent()`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 440899)
Time Spent: 1h 10m  (was: 1h)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=440851&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-440851
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 03/Jun/20 16:01
Start Date: 03/Jun/20 16:01
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on a change in pull request #807:
URL: https://github.com/apache/hive/pull/807#discussion_r434678396



##
File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
##
@@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());

Review comment:
   Please change this (and the rest) to:
   `
   LOG.debug("Checking for hv: {}", hv);
   `
   http://www.slf4j.org/faq.html#logging_performance

##
File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
##
@@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());
 
 if(cache.containsKey(hv)) {

Review comment:
   Please leverage `Map#computeIfAbsent` here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 440851)
Time Spent: 50m  (was: 40m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2020-06-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=440852&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-440852
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 03/Jun/20 16:01
Start Date: 03/Jun/20 16:01
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on pull request #807:
URL: https://github.com/apache/hive/pull/807#issuecomment-638292070


   Hive project is testing out using GitHub now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 440852)
Time Spent: 1h  (was: 50m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Alexey Diomin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2019-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=326191&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326191
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 10/Oct/19 08:59
Start Date: 10/Oct/19 08:59
Worklog Time Spent: 10m 
  Work Description: dlavati commented on pull request #807: HIVE-19261: 
Avro SerDe's InstanceCache should not be synchronized on retrieve
URL: https://github.com/apache/hive/pull/807#discussion_r333403943
 
 

 ##
 File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
 ##
 @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());
 
 if(cache.containsKey(hv)) {
   if(LOG.isDebugEnabled()) LOG.debug("Returning cache result.");
   return cache.get(hv);
+} else {
+  if(LOG.isDebugEnabled()) LOG.debug("Creating new instance and storing in 
cache");
 
 Review comment:
   @xhumanoid thanks for the contribution! I don't have committer rights to 
approve your patch, but LGTM with a few comments:
   - a small nitpick, but AFAIK, `isDebugEnabled` doesn't provide benefits 
here, only if you'd append/toString an expensive operation's result (yes, this 
is also ignored in a lot of places 😄)
   - to approve (+1) your changes, you'll have to become the assignee of the 
jira ticket that's currently under Fangshi Li (if that's fine with him) and 
change it's state with `Submit a patch` with an attached patch file (the 
previously attached file didn't get checked, as the issue state didn't change)
   - once all tests pass for it (might require a few reattachments/reruns), 
then someone will be able to approve it (also see 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-CreatingaPatch)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326191)
Time Spent: 40m  (was: 0.5h)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Fangshi Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2019-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=326190&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326190
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 10/Oct/19 08:57
Start Date: 10/Oct/19 08:57
Worklog Time Spent: 10m 
  Work Description: dlavati commented on pull request #807: HIVE-19261: 
Avro SerDe's InstanceCache should not be synchronized on retrieve
URL: https://github.com/apache/hive/pull/807#discussion_r333403943
 
 

 ##
 File path: serde/src/java/org/apache/hadoop/hive/serde2/avro/InstanceCache.java
 ##
 @@ -51,20 +51,19 @@ public Instance retrieve(SeedObject hv) throws 
AvroSerdeException {
* Retrieve (or create if it doesn't exist) the correct Instance for this
* SeedObject using 'seenSchemas' to resolve circular references
*/
-  public synchronized Instance retrieve(SeedObject hv,
-  Set seenSchemas) throws AvroSerdeException {
+  public Instance retrieve(SeedObject hv, Set seenSchemas)
+throws AvroSerdeException {
 if(LOG.isDebugEnabled()) LOG.debug("Checking for hv: " + hv.toString());
 
 if(cache.containsKey(hv)) {
   if(LOG.isDebugEnabled()) LOG.debug("Returning cache result.");
   return cache.get(hv);
+} else {
+  if(LOG.isDebugEnabled()) LOG.debug("Creating new instance and storing in 
cache");
 
 Review comment:
   @xhumanoid thanks for the contribution! I don't have committer rights to 
approve your patch, but LGTM with a few comments:
   - a small nitpick, but AFAIK, `isDebugEnabled` doesn't provide benefits 
here, only if you'd append/toString an expensive operation's result
   - to approve (+1) your changes, you'll have to become the assignee of the 
jira ticket that's currently under Fangshi Li (if that's fine with him) and 
change it's state with `Submit a patch` with an attached patch file (the 
previously attached file didn't get checked, as the issue state didn't change)
   - once all tests pass for it (might require a few reattachments/reruns), 
then someone will be able to approve it (also see 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-CreatingaPatch)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326190)
Time Spent: 0.5h  (was: 20m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Fangshi Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2019-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=326158&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326158
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 10/Oct/19 07:50
Start Date: 10/Oct/19 07:50
Worklog Time Spent: 10m 
  Work Description: xhumanoid commented on issue #807: HIVE-19261: Avro 
SerDe's InstanceCache should not be synchronized on retrieve
URL: https://github.com/apache/hive/pull/807#issuecomment-540443922
 
 
   @kgyrtkirk @dlavati could you review?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326158)
Time Spent: 20m  (was: 10m)

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Fangshi Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-19261) Avro SerDe's InstanceCache should not be synchronized on retrieve

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-19261?focusedWorklogId=325789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325789
 ]

ASF GitHub Bot logged work on HIVE-19261:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 16:04
Start Date: 09/Oct/19 16:04
Worklog Time Spent: 10m 
  Work Description: xhumanoid commented on pull request #807: HIVE-19261: 
Avro SerDe's InstanceCache should not be synchronized on retrieve
URL: https://github.com/apache/hive/pull/807
 
 
   Remove synchronization on cache retrieve have big performance impact
   will be great to have in branch-3 also
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325789)
Remaining Estimate: 0h
Time Spent: 10m

> Avro SerDe's InstanceCache should not be synchronized on retrieve
> -
>
> Key: HIVE-19261
> URL: https://issues.apache.org/jira/browse/HIVE-19261
> Project: Hive
>  Issue Type: Improvement
>Reporter: Fangshi Li
>Assignee: Fangshi Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-19261.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In HIVE-16175, upstream made a patch to fix the thread safety issue in 
> AvroSerDe's InstanceCache. This fix made the retrieve method in InstanceCache 
> synchronized. While it should make InstanceCache thread-safe, making retrieve 
> synchronized for the cache can be expensive in highly concurrent environment 
> like Spark, as multiple threads need to be synchronized on entering the 
> entire retrieve method.
> We are proposing another way to fix this thread safety issue by making the 
> underlying map of InstanceCache as ConcurrentHashMap. Ideally, we can use 
> atomic computeIfAbsent in the retrieve method to avoid synchronizing the 
> entire method.
> While computeIfAbsent is only available on java 8 and java 7 is still 
> supported in Hive,
> we use a pattern to simulate the behavior of computeIfAbsent. In the future, 
> we should move to computeIfAbsent when Hive requires java 8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)