[ 
https://issues.apache.org/jira/browse/THRIFT-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365025#comment-17365025
 ] 

Divye Kapoor commented on THRIFT-5430:
--------------------------------------

I wanted to put this in as a comment within the code, but it's too long and 
might pollute the code needlessly, leaving it here for anyone who will be 
interested in following up from the code references which will point to this 
ticket.
{code:java}
// Note: Do not use synchronized on this method declaration - it leads to a 
deadlock.
// Similarly, do not trigger sClass.newInstance() while holding a lock on 
structMap,
// it will lead to the same deadlock.
// See: https://issues.apache.org/jira/browse/THRIFT-5430 for details.{code}
{code:java}
public static Map<? extends TFieldIdEnum, FieldMetaData> getStructMetaDataMap(
    Class<? extends TBase> sClass) {
//
// Please keep in mind that the call to sClass.newInstance() below is meant to 
trigger the
// class static initializer process in native JVM code on the CURRENT thread.
// The process is described here:
// https://docs.oracle.com/javase/specs/jls/se8/html/jls-12.html#jls-12.4.2
//
// If another thread is also triggering the SAME static initializer code in 
native JVM
// through a different code path (eg. via new ThriftObject()), the two threads 
will compete
// on which one of them will actually execute the static initializer.
// This decision is taken in JVM native code. (Example: as part of the
// instanceKlass::initialize_impl method below)
// 
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/f0eac5a3aff7d413e2ef0532d7464a58d7a15237/hotspot/src/share/vm/oops/instanceKlass.cpp#L833
//
// If the CURRENT thread wins and executes the static initializer, there is no 
problem.
// The static initializer of the Thrift Struct will call
// code similar to this:
//
//  static {
//    Map<_Fields, FieldMetaData> tmpMap = new EnumMap<_Fields, 
FieldMetaData>(_Fields
//    .class);
//    Set<FieldValueMetaData> tmpSet = new HashSet<FieldValueMetaData>();
//    tmpMap.put(_Fields.EVENT_TYPE, new FieldMetaData("eventType", 
TFieldRequirementType
//    .OPTIONAL,
//      new EnumMetaData(TType.ENUM, EventType.class)));
//    tmpMap.put(_Fields.TIMESTAMP, new FieldMetaData("timestamp", 
TFieldRequirementType
//    .OPTIONAL,
//      new FieldValueMetaData(TType.I64)));
//
//    metaDataMap = Collections.unmodifiableMap(tmpMap);
//    binaryFieldValueMetaDatas = Collections.unmodifiableSet(tmpSet);
//    FieldMetaData.addStructMetaDataMap(IndexDataEvent.class, metaDataMap);  
// Re-entrant call
//  }
//
// The synchronized lock for FieldMetaData.class will be granted (because this 
thread is
// allowed to lock a lock it already holds). Here's the Java Spec on this issue:
// https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html#jls-17.1
//
// "Each object in Java is associated with a monitor, which a thread can lock 
or unlock.
// Only one thread at a time may hold a lock on a monitor. Any other threads 
attempting to
// lock that monitor are blocked until they can obtain a lock on that monitor. 
A thread t
// may lock a particular monitor multiple times; each unlock reverses the 
effect of
// one lock operation."
//
// If the OTHER thread wins and executes the static initalizer though, the 
CURRENT thread
// is STUCK in native code in Step 2 of the class initialization process:
// "If the Class object for C indicates that initialization is in progress for 
C by some
// other thread, then release LC and block the current thread until informed 
that the
// in-progress initialization has completed, at which time repeat this step."
//
// The OTHER thread is also blocked in Step 9 because it won't ever finish the 
static
// initialization call (the re-entrant call tries to acquire a lock that is 
held by the
// CURRENT thread). Therefore, we have a deadlock.
//
// How to avoid the deadlock: Don't hold any locks when calling sClass
// .newInstance() because of the re-entrant call to this class from the static 
initializer. 
  if (!structMap.containsKey(sClass)){ // Load class if it hasn't been loaded
    try{
      sClass.newInstance();
    } catch (InstantiationException e){
      throw new RuntimeException("InstantiationException for TBase class: " + 
sClass.getName() + ", message: " + e.getMessage());
    } catch (IllegalAccessException e){
      throw new RuntimeException("IllegalAccessException for TBase class: " + 
sClass.getName() + ", message: " + e.getMessage());
    }
  }
  return structMap.get(sClass);
}{code}

> FieldMetaData synchronized method can trigger deadlock during static class 
> initialization in JVM native code
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-5430
>                 URL: https://issues.apache.org/jira/browse/THRIFT-5430
>             Project: Thrift
>          Issue Type: Bug
>          Components: Java - Library
>    Affects Versions: 0.6, 0.6.1, 0.7, 0.8, 0.9, 0.9.1, 0.9.2, 0.9.3, 0.9.3.1, 
> 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 0.14.1
>         Environment: AdoptJDK 1.8 (Java 8) ; Linux, AWS machines. 
> Deadlock is environment independent and occurs in conformance with the JRE 8 
> spec.
>  
>            Reporter: Divye Kapoor
>            Priority: Major
>
> We (Pinterest) hit the following deadlock during JVM classloading of Thrift 
> classes. The root cause was triggering sClass.newInstance() while holding the 
> synchronized lock on FieldMetaData.class::getStructMetaDataMap(..)
> Here's the stacktraces of interest: 
>  Thread 1:
> {code:java}
> Stack Trace is:
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)  // 
> stuck here for > 30 mins
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at java.lang.Class.newInstance(Class.java:442) // creating an object via 
> reflection.
> at 
> org.apache.thrift.meta_data.FieldMetaData.getStructMetaDataMap(FieldMetaData.java:61)-
>  locked java.lang.Class@346242bb // Lock held on FieldMetaData.class
> at 
> com.pinterest.commons.thrift.ThriftStructDescriptor.<init>(ThriftStructDescriptor.java:56)
> at 
> com.pinterest.commons.thrift.ThriftStructDescriptor.getDescriptor(ThriftStructDescriptor.java:38)
>  {code}
> Thread 2:
> {code:java}
> Stack Trace is:
> at 
> org.apache.thrift.meta_data.FieldMetaData.addStructMetaDataMap(FieldMetaData.java:49)
> - blocked on java.lang.Class@346242bb  // Lock held by Thread 1.
> at 
> com.pinterest.schemas.search.query.NavboostScore.<clinit>(NavboostScore.java:146)
> at 
> com.pinterest.schemas.search.query.NavboostScores.read(NavboostScores.java:334)
> at 
> com.pinterest.schemas.search.query.SupplementaryTerm.read(SupplementaryTerm.java:1209)
> at 
> com.pinterest.schemas.search.query.SupplementaryTerms.read(SupplementaryTerms.java:335)
> at com.pinterest.schemas.search.query.QueryJoin.read(QueryJoin.java:6514)
> at 
> com.pinterest.pinpin.thrift.SerializedResourceContainer.read(SerializedResourceContainer.java:2867)
> at org.apache.thrift.TDeserializer.deserialize(TDeserializer.java:69) // 
> General Thrift deserialize. {code}
> Here's the code of interest:
> [https://github.com/apache/thrift/blob/65fb49bb41f852375b278c9057d52c9472f0cb3a/lib/java/src/org/apache/thrift/meta_data/FieldMetaData.java#L61]
>  
> Thread 1 has the following lock acquisition order:
> {noformat}
> FieldMetaData.class::getStructMetaDataMap 
> -> lock FieldMetaData.class (LOCK 1)
>   -> sClass.newInstance() 
>   -> load sClass 
>     -> JVM init_lock (native) lock and release. (LOCK 2)
>       -> waiting for Thread 2 to complete static initialization and notify 
> thread 1.{noformat}
> Thread 2 has the following lock acquisition order:
> {noformat}
> sClass.newInstance() (from new ThriftObject() / deserialization) 
>   ->  load sClass 
>   -> static initialization 
>   -> try locking FieldMetaData.class when calling 
> FieldMetaData.addStructMetaDataMap (WAIT on LOCK 1 which is never released)
>      ---> does not ever trigger the next step: 
>        ---> Notify Thread 1 that static initialization of the class has 
> completed (releasing LOCK 2 on Thread 1 which will eventually release LOCK 1 
> by returning from native code).{noformat}
> Internally, this was a fairly detailed investigation. Would be great if we 
> agree on a Fix approach. This deadlock affects all versions of Thrift since 
> 0.9 which introduced the synchronized keyword. Versions prior to 0.9 are 
> simply thread unsafe (because there's no lock on FieldMetaData's internal 
> HashMap and the two static method calls can race). I didn't check versions 
> below 0.5 but probably this extends all the way back. 
> This is not an issue in fbThrift, which correctly uses a ConcurrentHashMap 
> and does not take a lock on FieldMetaData. See this code here:
> [https://github.com/facebook/fbthrift/blob/c0f8ffb345d847df512ae9c404a58379fcd51cb9/thrift/lib/java/src/main/java/com/facebook/thrift/meta_data/FieldMetaData.java]
>  vs the buggy code here:
> [https://github.com/apache/thrift/blob/v0.14.1/lib/java/src/org/apache/thrift/meta_data/FieldMetaData.java]
>  
> Another alternative approach is to use 
> {code:java}
> synchronized (structMap} {
>  ... 
> } {code}
> instead of the synchronized keyword on the methods. 
> Please confirm which approach is preferred and we can send a PR on Github. 
>  
> Here's the original deadlock hypothesis as outlined by Jiahuan Liu:
> The hypothesis of the deadlock:
>  ThriftStructDescriptor.get is used in many places for deserialization and it 
> calls FieldMetaData.getStructMetaDataMap. inside this method, it tries to 
> load the thrift class if the thrift has not been loaded yet. thrift classes 
> call FieldMetaData.addStructMetaDataMap in a static block. so the deadlock 
> may happen when:
>  * thread A is trying to deserialize -> getStructMetaDataMap -> class loading
>  * if thread B is already trying to load the same thrift, it requires 
> addStructMetaDataMap to complete but this method is blocked by thread A 
> getStructMetaDataMap
>  * the class loading in thread A will be blocked by thread B because class 
> loading is also synchronized by java. see this section in java spec:
>  If the Class object for C indicates that initialization is in progress for C 
> by some other thread, then release LC and block the current thread until 
> informed that the in-progress initialization has completed, at which time 
> repeat this step.
> Here's the classloading spec in java 8:
>  [https://docs.oracle.com/javase/specs/jls/se8/html/jls-12.html#jls-12.4.2] 
> Thread 1 is stuck in step 2 of the initialization (waiting for Thread 2 to 
> notify in Step 10).
>  Thread 2 is stuck in step 9 of the initialization and never makes it to step 
> 10. 
>  Precondition: this is the first object creation for that Thrift object and 
> at-least 2 threads are racing to create the object, one of which holds a lock 
> on FieldMetaData.class. 
> Please reach out if this is unclear. This hypothesis was validated by reading 
> through the native JVM code on the class load -> static_initialization path. 
> Here's the native JVM code if you're interested: 
> [InstanceKlass::initialize_impl|https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/f0eac5a3aff7d413e2ef0532d7464a58d7a15237/hotspot/src/share/vm/oops/instanceKlass.cpp#L833]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to