[ 
https://issues.apache.org/jira/browse/SPARK-30916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingda Jia updated SPARK-30916:
-------------------------------
    Attachment: image-2020-02-21-16-30-35-652.png

> Dead Lock when Loading BLAS Class
> ---------------------------------
>
>                 Key: SPARK-30916
>                 URL: https://issues.apache.org/jira/browse/SPARK-30916
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.3.0, 2.4.5
>            Reporter: Mingda Jia
>            Priority: Major
>         Attachments: image-2020-02-21-16-30-35-652.png, 
> image-2020-02-21-16-30-45-553.png
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using transactions including aggregation and treeAggregation, and the 
> seqOp and combOp accept level 1 and level 2 BLAS operations respectively, it 
> will cause a JVM internal dead lock which is hard to detect.
>  
> Say the seqOp runs gemv, which is a level 2 BLAS operation and the combOp 
> runs axpy, which is a level 1 BLAS operation. When a task takes seqOp meets 
> another task takes combOp, the two task threads stuck. The call stacks are 
> like this:
> !image-2020-02-21-15-52-49-846.png!
> !image-2020-02-21-15-53-22-870.png!
> The threads states are all runnable, but actually they are not running.
>  
> When calling the function gemv, if there is not an existing BLAS instance, it 
> will call the getInstance method to get a BLAS instance. The first entered 
> thread will run the static code block of BLAS.scala, which tries loading a 
> subclass of BLAS and instantiate the class with reflection.
> !image-2020-02-21-16-00-40-552.png!
>  
> When calling the function axpy, if there is not an existing BLAS instance, it 
> will new a F2jBLAS instance directly cause it is a level 1 BLAS operation.
> !image-2020-02-21-16-02-25-136.png!
>  
> The problem is, the classes NativeSystemBLAS, NativeRefBLAS and F2jBLAS which 
> BLAS wants to load in the static code block are all subclasses of F2jBLAS, or 
> even F2jBLAS it self. The sequence of loading class in the static code block 
> of BLAS is like this:
>  # tries loading class BLAS -> lock the class BLAS
>  # tries loading class NativeSystemBLAS in the static code block -> lock the 
> class NativeSystemBLAS
>  # recursively load F2jBLAS because it's the parent class of NativeSystemBLAS 
> -> lock the class F2jBLAS
>  # ......
> Simultaneously, the sequence of new an F2jBLAS in the axpy operation is like 
> this:
>  # tries loading class F2jBLAS -> lock the class F2jBLAS
>  # recursively load BLAS because it's the parent class of F2jBLAS -> lock the 
> class BLAS
>  # ......
> When one task thread which runs the gemv operation just finished its second 
> step above, and the other task thread which runs the axpy operation  just 
> finished its first step above, the gemv thread wants to load class F2jBLAS 
> but it is locked by the axpy thread, and the axpy thread wants to load class 
> BLAS but it is locked by the gemv thread, in which case a dead lock is 
> generated. 
>  
> A demo which can reproduce the problem is like this:
> {code:java}
> class Demo {
>     public static void main(String[] args) {
>         Thread t1 = new Thread(new Runnable() {
>             @Override
>             public void run() {
>                 BLAS blas = BLAS.getInstance();
>                 blas.print();
>             }
>         });
>         Thread t2 = new Thread(new Runnable() {
>             @Override
>             public void run() {
>                 BLAS blas = new F2jBLAS();
>                 blas.print();
>             }
>         });
>         t1.setName("native");
>         t2.setName("f2j");
>         t1.start();
>         t2.start();
>     }
> }
> abstract class BLAS {
>     public static BLAS instance;
>     abstract public void print();
>     public static BLAS getInstance() {
>         return instance;
>     }
>     private static BLAS load() throws Exception{
>         Class klass = Class.forName("NativeSystemBlas");
>         return (BLAS) klass.newInstance();
>     }
>     static {
>         System.out.println("Entered static code block" );
>         try {
>             instance = load();
>         } catch (Exception e) {
>             System.out.println("error");
>         }
>     }
> }
> class F2jBLAS extends BLAS{
>     @Override
>     public void print() {
>         System.out.println("print F2j");
>     }
> }
> class NativeSystemBlas extends F2jBLAS {
>     @Override
>     public void print(){
>         System.out.println("print NativeBlas");
>     }
> }
> {code}
> If BLAS operations in spark MLlib do not use F2jBLAS for level 1 operations 
> but use the same instantiation as the nativeBLAS, there won't be such a 
> problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to