[ https://issues.apache.org/jira/browse/SPARK-30916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingda Jia updated SPARK-30916: ------------------------------- Attachment: image-2020-02-21-16-30-35-652.png > Dead Lock when Loading BLAS Class > --------------------------------- > > Key: SPARK-30916 > URL: https://issues.apache.org/jira/browse/SPARK-30916 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 2.3.0, 2.4.5 > Reporter: Mingda Jia > Priority: Major > Attachments: image-2020-02-21-16-30-35-652.png, > image-2020-02-21-16-30-45-553.png > > Original Estimate: 2h > Remaining Estimate: 2h > > When using transactions including aggregation and treeAggregation, and the > seqOp and combOp accept level 1 and level 2 BLAS operations respectively, it > will cause a JVM internal dead lock which is hard to detect. > > Say the seqOp runs gemv, which is a level 2 BLAS operation and the combOp > runs axpy, which is a level 1 BLAS operation. When a task takes seqOp meets > another task takes combOp, the two task threads stuck. The call stacks are > like this: > !image-2020-02-21-15-52-49-846.png! > !image-2020-02-21-15-53-22-870.png! > The threads states are all runnable, but actually they are not running. > > When calling the function gemv, if there is not an existing BLAS instance, it > will call the getInstance method to get a BLAS instance. The first entered > thread will run the static code block of BLAS.scala, which tries loading a > subclass of BLAS and instantiate the class with reflection. > !image-2020-02-21-16-00-40-552.png! > > When calling the function axpy, if there is not an existing BLAS instance, it > will new a F2jBLAS instance directly cause it is a level 1 BLAS operation. > !image-2020-02-21-16-02-25-136.png! > > The problem is, the classes NativeSystemBLAS, NativeRefBLAS and F2jBLAS which > BLAS wants to load in the static code block are all subclasses of F2jBLAS, or > even F2jBLAS it self. The sequence of loading class in the static code block > of BLAS is like this: > # tries loading class BLAS -> lock the class BLAS > # tries loading class NativeSystemBLAS in the static code block -> lock the > class NativeSystemBLAS > # recursively load F2jBLAS because it's the parent class of NativeSystemBLAS > -> lock the class F2jBLAS > # ...... > Simultaneously, the sequence of new an F2jBLAS in the axpy operation is like > this: > # tries loading class F2jBLAS -> lock the class F2jBLAS > # recursively load BLAS because it's the parent class of F2jBLAS -> lock the > class BLAS > # ...... > When one task thread which runs the gemv operation just finished its second > step above, and the other task thread which runs the axpy operation just > finished its first step above, the gemv thread wants to load class F2jBLAS > but it is locked by the axpy thread, and the axpy thread wants to load class > BLAS but it is locked by the gemv thread, in which case a dead lock is > generated. > > A demo which can reproduce the problem is like this: > {code:java} > class Demo { > public static void main(String[] args) { > Thread t1 = new Thread(new Runnable() { > @Override > public void run() { > BLAS blas = BLAS.getInstance(); > blas.print(); > } > }); > Thread t2 = new Thread(new Runnable() { > @Override > public void run() { > BLAS blas = new F2jBLAS(); > blas.print(); > } > }); > t1.setName("native"); > t2.setName("f2j"); > t1.start(); > t2.start(); > } > } > abstract class BLAS { > public static BLAS instance; > abstract public void print(); > public static BLAS getInstance() { > return instance; > } > private static BLAS load() throws Exception{ > Class klass = Class.forName("NativeSystemBlas"); > return (BLAS) klass.newInstance(); > } > static { > System.out.println("Entered static code block" ); > try { > instance = load(); > } catch (Exception e) { > System.out.println("error"); > } > } > } > class F2jBLAS extends BLAS{ > @Override > public void print() { > System.out.println("print F2j"); > } > } > class NativeSystemBlas extends F2jBLAS { > @Override > public void print(){ > System.out.println("print NativeBlas"); > } > } > {code} > If BLAS operations in spark MLlib do not use F2jBLAS for level 1 operations > but use the same instantiation as the nativeBLAS, there won't be such a > problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org