[
https://issues.apache.org/jira/browse/SPARK-30916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mingda Jia updated SPARK-30916:
-------------------------------
Attachment: image-2020-02-21-16-31-19-880.png
> Dead Lock when Loading BLAS Class
> ---------------------------------
>
> Key: SPARK-30916
> URL: https://issues.apache.org/jira/browse/SPARK-30916
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 2.3.0, 2.4.5
> Reporter: Mingda Jia
> Priority: Major
> Attachments: image-2020-02-21-16-30-35-652.png,
> image-2020-02-21-16-30-45-553.png, image-2020-02-21-16-31-19-880.png,
> image-2020-02-21-16-31-34-274.png
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> When using transactions including aggregation and treeAggregation, and the
> seqOp and combOp accept level 1 and level 2 BLAS operations respectively, it
> will cause a JVM internal dead lock which is hard to detect.
>
> Say the seqOp runs gemv, which is a level 2 BLAS operation and the combOp
> runs axpy, which is a level 1 BLAS operation. When a task takes seqOp meets
> another task takes combOp, the two task threads stuck. The call stacks are
> like this:
> !image-2020-02-21-15-52-49-846.png!
> !image-2020-02-21-15-53-22-870.png!
> The threads states are all runnable, but actually they are not running.
>
> When calling the function gemv, if there is not an existing BLAS instance, it
> will call the getInstance method to get a BLAS instance. The first entered
> thread will run the static code block of BLAS.scala, which tries loading a
> subclass of BLAS and instantiate the class with reflection.
> !image-2020-02-21-16-00-40-552.png!
>
> When calling the function axpy, if there is not an existing BLAS instance, it
> will new a F2jBLAS instance directly cause it is a level 1 BLAS operation.
> !image-2020-02-21-16-02-25-136.png!
>
> The problem is, the classes NativeSystemBLAS, NativeRefBLAS and F2jBLAS which
> BLAS wants to load in the static code block are all subclasses of F2jBLAS, or
> even F2jBLAS it self. The sequence of loading class in the static code block
> of BLAS is like this:
> # tries loading class BLAS -> lock the class BLAS
> # tries loading class NativeSystemBLAS in the static code block -> lock the
> class NativeSystemBLAS
> # recursively load F2jBLAS because it's the parent class of NativeSystemBLAS
> -> lock the class F2jBLAS
> # ......
> Simultaneously, the sequence of new an F2jBLAS in the axpy operation is like
> this:
> # tries loading class F2jBLAS -> lock the class F2jBLAS
> # recursively load BLAS because it's the parent class of F2jBLAS -> lock the
> class BLAS
> # ......
> When one task thread which runs the gemv operation just finished its second
> step above, and the other task thread which runs the axpy operation just
> finished its first step above, the gemv thread wants to load class F2jBLAS
> but it is locked by the axpy thread, and the axpy thread wants to load class
> BLAS but it is locked by the gemv thread, in which case a dead lock is
> generated.
>
> A demo which can reproduce the problem is like this:
> {code:java}
> class Demo {
> public static void main(String[] args) {
> Thread t1 = new Thread(new Runnable() {
> @Override
> public void run() {
> BLAS blas = BLAS.getInstance();
> blas.print();
> }
> });
> Thread t2 = new Thread(new Runnable() {
> @Override
> public void run() {
> BLAS blas = new F2jBLAS();
> blas.print();
> }
> });
> t1.setName("native");
> t2.setName("f2j");
> t1.start();
> t2.start();
> }
> }
> abstract class BLAS {
> public static BLAS instance;
> abstract public void print();
> public static BLAS getInstance() {
> return instance;
> }
> private static BLAS load() throws Exception{
> Class klass = Class.forName("NativeSystemBlas");
> return (BLAS) klass.newInstance();
> }
> static {
> System.out.println("Entered static code block" );
> try {
> instance = load();
> } catch (Exception e) {
> System.out.println("error");
> }
> }
> }
> class F2jBLAS extends BLAS{
> @Override
> public void print() {
> System.out.println("print F2j");
> }
> }
> class NativeSystemBlas extends F2jBLAS {
> @Override
> public void print(){
> System.out.println("print NativeBlas");
> }
> }
> {code}
> If BLAS operations in spark MLlib do not use F2jBLAS for level 1 operations
> but use the same instantiation as the nativeBLAS, there won't be such a
> problem.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]