[ https://issues.apache.org/jira/browse/THRIFT-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366915#comment-17366915 ]
Divye Kapoor commented on THRIFT-5430: -------------------------------------- CI builds seem to be failing with an error (potentially unrelated): {noformat} 34: ((FLET SB-FASL::LOAD-STREAM :IN LOAD) #<SB-SYS:FD-STREAM for "file /thrift/src/test/cl/make-test-server.lisp" {100154E873}> NIL)1836035: (LOAD #<SB-SYS:FD-STREAM for "file /thrift/src/test/cl/make-test-server.lisp" {100154E873}> :VERBOSE NIL :PRINT NIL :IF-DOES-NOT-EXIST T :EXTERNAL-FORMAT :DEFAULT)1836136: ((FLET SB-IMPL::LOAD-SCRIPT :IN SB-IMPL::PROCESS-SCRIPT) #<SB-SYS:FD-STREAM for "file /thrift/src/test/cl/make-test-server.lisp" {100154E873}>)1836237: ((FLET SB-UNIX::BODY :IN SB-IMPL::PROCESS-SCRIPT))1836338: ((FLET "WITHOUT-INTERRUPTS-BODY-2" :IN SB-IMPL::PROCESS-SCRIPT))1836439: (SB-IMPL::PROCESS-SCRIPT "make-test-server.lisp")1836540: (SB-IMPL::TOPLEVEL-INIT)1836641: ((FLET SB-UNIX::BODY :IN SAVE-LISP-AND-DIE))1836742: ((FLET "WITHOUT-INTERRUPTS-BODY-7" :IN SAVE-LISP-AND-DIE))1836843: ((LABELS SB-IMPL::RESTART-LISP :IN SAVE-LISP-AND-DIE))1836918370unhandled condition in --disable-debugger mode, quitting18371Makefile:626: recipe for target 'TestServer' failed18372make[2]: *** [TestServer] Error 1 unhandled condition in --disable-debugger mode, quitting Makefile:626: recipe for target 'TestServer' failed make[2]: *** [TestServer] Error make[2]: Leaving directory '/thrift/src/test/cl' Makefile:919: recipe for target 'precross-cl' failed make[1]: *** [precross-cl] Error make[1]: Leaving directory '/thrift/src/test' Makefile:1122: recipe for target 'precross-test' failed make: *** [precross-test] Error make: *** Waiting for unfinished jobs.... {noformat} Can someone confirm that this is a CI related infra issue and not a PR issue? I found this (related) issue online: [https://bugzilla.redhat.com/show_bug.cgi?id=214568] > FieldMetaData synchronized method can trigger deadlock during static class > initialization in JVM native code > ------------------------------------------------------------------------------------------------------------ > > Key: THRIFT-5430 > URL: https://issues.apache.org/jira/browse/THRIFT-5430 > Project: Thrift > Issue Type: Bug > Components: Java - Library > Affects Versions: 0.6, 0.6.1, 0.7, 0.8, 0.9, 0.9.1, 0.9.2, 0.9.3, 0.9.3.1, > 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 0.14.1 > Environment: AdoptJDK 1.8 (Java 8) ; Linux, AWS machines. > Deadlock is environment independent and occurs in conformance with the JRE 8 > spec. > > Reporter: Divye Kapoor > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We (Pinterest) hit the following deadlock during JVM classloading of Thrift > classes. The root cause was triggering sClass.newInstance() while holding the > synchronized lock on FieldMetaData.class::getStructMetaDataMap(..) > Here's the stacktraces of interest: > Thread 1: > {code:java} > Stack Trace is: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) // > stuck here for > 30 mins > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) // creating an object via > reflection. > at > org.apache.thrift.meta_data.FieldMetaData.getStructMetaDataMap(FieldMetaData.java:61)- > locked java.lang.Class@346242bb // Lock held on FieldMetaData.class > at > com.pinterest.commons.thrift.ThriftStructDescriptor.<init>(ThriftStructDescriptor.java:56) > at > com.pinterest.commons.thrift.ThriftStructDescriptor.getDescriptor(ThriftStructDescriptor.java:38) > {code} > Thread 2: > {code:java} > Stack Trace is: > at > org.apache.thrift.meta_data.FieldMetaData.addStructMetaDataMap(FieldMetaData.java:49) > - blocked on java.lang.Class@346242bb // Lock held by Thread 1. > at > com.pinterest.schemas.search.query.NavboostScore.<clinit>(NavboostScore.java:146) > at > com.pinterest.schemas.search.query.NavboostScores.read(NavboostScores.java:334) > at > com.pinterest.schemas.search.query.SupplementaryTerm.read(SupplementaryTerm.java:1209) > at > com.pinterest.schemas.search.query.SupplementaryTerms.read(SupplementaryTerms.java:335) > at com.pinterest.schemas.search.query.QueryJoin.read(QueryJoin.java:6514) > at > com.pinterest.pinpin.thrift.SerializedResourceContainer.read(SerializedResourceContainer.java:2867) > at org.apache.thrift.TDeserializer.deserialize(TDeserializer.java:69) // > General Thrift deserialize. {code} > Here's the code of interest: > [https://github.com/apache/thrift/blob/65fb49bb41f852375b278c9057d52c9472f0cb3a/lib/java/src/org/apache/thrift/meta_data/FieldMetaData.java#L61] > > Thread 1 has the following lock acquisition order: > {noformat} > FieldMetaData.class::getStructMetaDataMap > -> lock FieldMetaData.class (LOCK 1) > -> sClass.newInstance() > -> load sClass > -> JVM init_lock (native) lock and release. (LOCK 2) > -> waiting for Thread 2 to complete static initialization and notify > thread 1.{noformat} > Thread 2 has the following lock acquisition order: > {noformat} > sClass.newInstance() (from new ThriftObject() / deserialization) > -> load sClass > -> static initialization > -> try locking FieldMetaData.class when calling > FieldMetaData.addStructMetaDataMap (WAIT on LOCK 1 which is never released) > ---> does not ever trigger the next step: > ---> Notify Thread 1 that static initialization of the class has > completed (releasing LOCK 2 on Thread 1 which will eventually release LOCK 1 > by returning from native code).{noformat} > Internally, this was a fairly detailed investigation. Would be great if we > agree on a Fix approach. This deadlock affects all versions of Thrift since > 0.9 which introduced the synchronized keyword. Versions prior to 0.9 are > simply thread unsafe (because there's no lock on FieldMetaData's internal > HashMap and the two static method calls can race). I didn't check versions > below 0.5 but probably this extends all the way back. > This is not an issue in fbThrift, which correctly uses a ConcurrentHashMap > and does not take a lock on FieldMetaData. See this code here: > [https://github.com/facebook/fbthrift/blob/c0f8ffb345d847df512ae9c404a58379fcd51cb9/thrift/lib/java/src/main/java/com/facebook/thrift/meta_data/FieldMetaData.java] > vs the buggy code here: > [https://github.com/apache/thrift/blob/v0.14.1/lib/java/src/org/apache/thrift/meta_data/FieldMetaData.java] > > Another alternative approach is to use > {code:java} > synchronized (structMap} { > ... > } {code} > instead of the synchronized keyword on the methods. > Please confirm which approach is preferred and we can send a PR on Github. > > Here's the original deadlock hypothesis as outlined by Jiahuan Liu: > The hypothesis of the deadlock: > ThriftStructDescriptor.get is used in many places for deserialization and it > calls FieldMetaData.getStructMetaDataMap. inside this method, it tries to > load the thrift class if the thrift has not been loaded yet. thrift classes > call FieldMetaData.addStructMetaDataMap in a static block. so the deadlock > may happen when: > * thread A is trying to deserialize -> getStructMetaDataMap -> class loading > * if thread B is already trying to load the same thrift, it requires > addStructMetaDataMap to complete but this method is blocked by thread A > getStructMetaDataMap > * the class loading in thread A will be blocked by thread B because class > loading is also synchronized by java. see this section in java spec: > If the Class object for C indicates that initialization is in progress for C > by some other thread, then release LC and block the current thread until > informed that the in-progress initialization has completed, at which time > repeat this step. > Here's the classloading spec in java 8: > [https://docs.oracle.com/javase/specs/jls/se8/html/jls-12.html#jls-12.4.2] > Thread 1 is stuck in step 2 of the initialization (waiting for Thread 2 to > notify in Step 10). > Thread 2 is stuck in step 9 of the initialization and never makes it to step > 10. > Precondition: this is the first object creation for that Thrift object and > at-least 2 threads are racing to create the object, one of which holds a lock > on FieldMetaData.class. > Please reach out if this is unclear. This hypothesis was validated by reading > through the native JVM code on the class load -> static_initialization path. > Here's the native JVM code if you're interested: > [InstanceKlass::initialize_impl|https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/f0eac5a3aff7d413e2ef0532d7464a58d7a15237/hotspot/src/share/vm/oops/instanceKlass.cpp#L833] -- This message was sent by Atlassian Jira (v8.3.4#803005)