Ilya Matiach created SPARK-26498:
------------------------------------

             Summary: Integrate barrier execution with MMLSpark's LightGBM
                 Key: SPARK-26498
                 URL: https://issues.apache.org/jira/browse/SPARK-26498
             Project: Spark
          Issue Type: New Feature
          Components: ML, MLlib
    Affects Versions: 2.4.0
            Reporter: Ilya Matiach


I would like to use the new barrier execution mode introduced in spark 2.4 with 
LightGBM in the spark package mmlspark but I ran into some issues.

Currently, the LightGBM distributed learner tries to figure out the number of 
cores on the cluster and then does a coalesce and a mapPartitions, and inside 
the mapPartitions we do a NetworkInit (where the address:port of all workers 
needs to be passed in the constructor) and pass the data in-memory to the 
native layer of the distributed lightgbm learner.

With barrier execution mode, I think the code would become much more robust.  
However, there are several issues that I am running into when trying to move my 
code over to the new barrier execution mode scheduler:

Does not support dynamic allocation – however, I think it would be convenient 
if it restarted the job when the number of workers has decreased and allowed 
the dev to decide whether to restart the job if the number of workers increased

Does not work with DataFrame or Dataset API, but I think it would be much more 
convenient if it did.

How does barrier execution mode deal with #partitions > #tasks?  If the number 
of partitions is larger than the number of “tasks” or workers, can barrier 
execution mode automatically coalesce the dataset to have # partitions == # 
tasks?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to