[ 
https://issues.apache.org/jira/browse/THRIFT-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366915#comment-17366915
 ] 

Divye Kapoor commented on THRIFT-5430:
--------------------------------------

CI builds seem to be failing with an error (potentially unrelated):
{noformat}
34: ((FLET SB-FASL::LOAD-STREAM :IN LOAD) #<SB-SYS:FD-STREAM for "file 
/thrift/src/test/cl/make-test-server.lisp" {100154E873}> NIL)1836035: (LOAD 
#<SB-SYS:FD-STREAM for "file /thrift/src/test/cl/make-test-server.lisp" 
{100154E873}> :VERBOSE NIL :PRINT NIL :IF-DOES-NOT-EXIST T :EXTERNAL-FORMAT 
:DEFAULT)1836136: ((FLET SB-IMPL::LOAD-SCRIPT :IN SB-IMPL::PROCESS-SCRIPT) 
#<SB-SYS:FD-STREAM for "file /thrift/src/test/cl/make-test-server.lisp" 
{100154E873}>)1836237: ((FLET SB-UNIX::BODY :IN 
SB-IMPL::PROCESS-SCRIPT))1836338: ((FLET "WITHOUT-INTERRUPTS-BODY-2" :IN 
SB-IMPL::PROCESS-SCRIPT))1836439: (SB-IMPL::PROCESS-SCRIPT 
"make-test-server.lisp")1836540: (SB-IMPL::TOPLEVEL-INIT)1836641: ((FLET 
SB-UNIX::BODY :IN SAVE-LISP-AND-DIE))1836742: ((FLET 
"WITHOUT-INTERRUPTS-BODY-7" :IN SAVE-LISP-AND-DIE))1836843: ((LABELS 
SB-IMPL::RESTART-LISP :IN SAVE-LISP-AND-DIE))1836918370unhandled condition in 
--disable-debugger mode, quitting18371Makefile:626: recipe for target 
'TestServer' failed18372make[2]: *** [TestServer] Error 1
unhandled condition in --disable-debugger mode, quitting

Makefile:626: recipe for target 'TestServer' failed
make[2]: *** [TestServer] Error make[2]: Leaving directory '/thrift/src/test/cl'
Makefile:919: recipe for target 'precross-cl' failed
make[1]: *** [precross-cl] Error 
make[1]: Leaving directory '/thrift/src/test'
Makefile:1122: recipe for target 'precross-test' failed
make: *** [precross-test] Error 
make: *** Waiting for unfinished jobs.... {noformat}
Can someone confirm that this is a CI related infra issue and not a PR issue? I 
found this (related) issue online: 
[https://bugzilla.redhat.com/show_bug.cgi?id=214568] 

> FieldMetaData synchronized method can trigger deadlock during static class 
> initialization in JVM native code
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-5430
>                 URL: https://issues.apache.org/jira/browse/THRIFT-5430
>             Project: Thrift
>          Issue Type: Bug
>          Components: Java - Library
>    Affects Versions: 0.6, 0.6.1, 0.7, 0.8, 0.9, 0.9.1, 0.9.2, 0.9.3, 0.9.3.1, 
> 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 0.14.1
>         Environment: AdoptJDK 1.8 (Java 8) ; Linux, AWS machines. 
> Deadlock is environment independent and occurs in conformance with the JRE 8 
> spec.
>  
>            Reporter: Divye Kapoor
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We (Pinterest) hit the following deadlock during JVM classloading of Thrift 
> classes. The root cause was triggering sClass.newInstance() while holding the 
> synchronized lock on FieldMetaData.class::getStructMetaDataMap(..)
> Here's the stacktraces of interest: 
>  Thread 1:
> {code:java}
> Stack Trace is:
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)  // 
> stuck here for > 30 mins
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at java.lang.Class.newInstance(Class.java:442) // creating an object via 
> reflection.
> at 
> org.apache.thrift.meta_data.FieldMetaData.getStructMetaDataMap(FieldMetaData.java:61)-
>  locked java.lang.Class@346242bb // Lock held on FieldMetaData.class
> at 
> com.pinterest.commons.thrift.ThriftStructDescriptor.<init>(ThriftStructDescriptor.java:56)
> at 
> com.pinterest.commons.thrift.ThriftStructDescriptor.getDescriptor(ThriftStructDescriptor.java:38)
>  {code}
> Thread 2:
> {code:java}
> Stack Trace is:
> at 
> org.apache.thrift.meta_data.FieldMetaData.addStructMetaDataMap(FieldMetaData.java:49)
> - blocked on java.lang.Class@346242bb  // Lock held by Thread 1.
> at 
> com.pinterest.schemas.search.query.NavboostScore.<clinit>(NavboostScore.java:146)
> at 
> com.pinterest.schemas.search.query.NavboostScores.read(NavboostScores.java:334)
> at 
> com.pinterest.schemas.search.query.SupplementaryTerm.read(SupplementaryTerm.java:1209)
> at 
> com.pinterest.schemas.search.query.SupplementaryTerms.read(SupplementaryTerms.java:335)
> at com.pinterest.schemas.search.query.QueryJoin.read(QueryJoin.java:6514)
> at 
> com.pinterest.pinpin.thrift.SerializedResourceContainer.read(SerializedResourceContainer.java:2867)
> at org.apache.thrift.TDeserializer.deserialize(TDeserializer.java:69) // 
> General Thrift deserialize. {code}
> Here's the code of interest:
> [https://github.com/apache/thrift/blob/65fb49bb41f852375b278c9057d52c9472f0cb3a/lib/java/src/org/apache/thrift/meta_data/FieldMetaData.java#L61]
>  
> Thread 1 has the following lock acquisition order:
> {noformat}
> FieldMetaData.class::getStructMetaDataMap 
> -> lock FieldMetaData.class (LOCK 1)
>   -> sClass.newInstance() 
>   -> load sClass 
>     -> JVM init_lock (native) lock and release. (LOCK 2)
>       -> waiting for Thread 2 to complete static initialization and notify 
> thread 1.{noformat}
> Thread 2 has the following lock acquisition order:
> {noformat}
> sClass.newInstance() (from new ThriftObject() / deserialization) 
>   ->  load sClass 
>   -> static initialization 
>   -> try locking FieldMetaData.class when calling 
> FieldMetaData.addStructMetaDataMap (WAIT on LOCK 1 which is never released)
>      ---> does not ever trigger the next step: 
>        ---> Notify Thread 1 that static initialization of the class has 
> completed (releasing LOCK 2 on Thread 1 which will eventually release LOCK 1 
> by returning from native code).{noformat}
> Internally, this was a fairly detailed investigation. Would be great if we 
> agree on a Fix approach. This deadlock affects all versions of Thrift since 
> 0.9 which introduced the synchronized keyword. Versions prior to 0.9 are 
> simply thread unsafe (because there's no lock on FieldMetaData's internal 
> HashMap and the two static method calls can race). I didn't check versions 
> below 0.5 but probably this extends all the way back. 
> This is not an issue in fbThrift, which correctly uses a ConcurrentHashMap 
> and does not take a lock on FieldMetaData. See this code here:
> [https://github.com/facebook/fbthrift/blob/c0f8ffb345d847df512ae9c404a58379fcd51cb9/thrift/lib/java/src/main/java/com/facebook/thrift/meta_data/FieldMetaData.java]
>  vs the buggy code here:
> [https://github.com/apache/thrift/blob/v0.14.1/lib/java/src/org/apache/thrift/meta_data/FieldMetaData.java]
>  
> Another alternative approach is to use 
> {code:java}
> synchronized (structMap} {
>  ... 
> } {code}
> instead of the synchronized keyword on the methods. 
> Please confirm which approach is preferred and we can send a PR on Github. 
>  
> Here's the original deadlock hypothesis as outlined by Jiahuan Liu:
> The hypothesis of the deadlock:
>  ThriftStructDescriptor.get is used in many places for deserialization and it 
> calls FieldMetaData.getStructMetaDataMap. inside this method, it tries to 
> load the thrift class if the thrift has not been loaded yet. thrift classes 
> call FieldMetaData.addStructMetaDataMap in a static block. so the deadlock 
> may happen when:
>  * thread A is trying to deserialize -> getStructMetaDataMap -> class loading
>  * if thread B is already trying to load the same thrift, it requires 
> addStructMetaDataMap to complete but this method is blocked by thread A 
> getStructMetaDataMap
>  * the class loading in thread A will be blocked by thread B because class 
> loading is also synchronized by java. see this section in java spec:
>  If the Class object for C indicates that initialization is in progress for C 
> by some other thread, then release LC and block the current thread until 
> informed that the in-progress initialization has completed, at which time 
> repeat this step.
> Here's the classloading spec in java 8:
>  [https://docs.oracle.com/javase/specs/jls/se8/html/jls-12.html#jls-12.4.2] 
> Thread 1 is stuck in step 2 of the initialization (waiting for Thread 2 to 
> notify in Step 10).
>  Thread 2 is stuck in step 9 of the initialization and never makes it to step 
> 10. 
>  Precondition: this is the first object creation for that Thrift object and 
> at-least 2 threads are racing to create the object, one of which holds a lock 
> on FieldMetaData.class. 
> Please reach out if this is unclear. This hypothesis was validated by reading 
> through the native JVM code on the class load -> static_initialization path. 
> Here's the native JVM code if you're interested: 
> [InstanceKlass::initialize_impl|https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/f0eac5a3aff7d413e2ef0532d7464a58d7a15237/hotspot/src/share/vm/oops/instanceKlass.cpp#L833]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to