barrier execution mode with DataFrame and dynamic allocation

Ilya Matiach Wed, 19 Dec 2018 07:34:34 -0800

[Note: I sent this earlier but it looks like the email was blocked because I 
had another email group on the CC line]
Hi Spark Dev,
I would like to use the new barrier execution mode introduced in spark 
2.4<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Freleases%2Fspark-release-2-4-0.html&data=02%7C01%7Cilmat%40microsoft.com%7Cd3eb38abd45848607bcd08d66208deff%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636804187576103052&sdata=4zvcTOmKHOsO%2FD46r0eDzk4hikQpU5a3Mvwzmj2ZY7o%3D&reserved=0>
 with 
LightGBM<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fmmlspark%2Fblob%2Fmaster%2Fsrc%2Flightgbm%2Fsrc%2Fmain%2Fscala%2FLightGBMClassifier.scala&data=02%7C01%7Cilmat%40microsoft.com%7Cd3eb38abd45848607bcd08d66208deff%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636804187576133081&sdata=1Uj1ViEy9mR9IYEhUEXPXVUwpLePvcluxagqKEE9xAE%3D&reserved=0>
 in the spark package 
mmlspark<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fmmlspark&data=02%7C01%7Cilmat%40microsoft.com%7Cd3eb38abd45848607bcd08d66208deff%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636804187576133081&sdata=hfk6hD3z8D68vW%2FEQOwgMFGIvX6%2FvHMOaMnoGAIFjag%3D&reserved=0>
 but I ran into some issues and I had a couple questions.
Currently, the LightGBM distributed learner tries to figure out the number of 
cores on the cluster and then does a coalesce and a mapPartitions, and inside 
the mapPartitions we do a 
NetworkInit<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FLightGBM%2Fblob%2Fmaster%2Finclude%2FLightGBM%2Fc_api.h%23L797&data=02%7C01%7Cilmat%40microsoft.com%7Cd3eb38abd45848607bcd08d66208deff%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636804187576153094&sdata=oMP3roDrvEVLNea7Jt22YsY%2BfcBCHBPiEFo1OmAEdfM%3D&reserved=0>
 (where the address:port of all workers needs to be passed in the constructor) 
and pass the data in-memory to the native layer of the distributed lightgbm 
learner.


With barrier execution mode, I think the code would become much more robust.  
However, there are several issues that I am running into when trying to move my 
code over to the new barrier execution mode scheduler:

  1.  Does not support dynamic allocation - however, I think it would be 
convenient if it restarted the job when the number of workers has decreased and 
allowed the dev to decide whether to restart the job if the number of workers 
increased
  2.  Does not work with DataFrame or Dataset API, but I think it would be much 
more convenient if it did
  3.  How does barrier execution mode deal with #partitions > #tasks?  If the 
number of partitions is larger than the number of "tasks" or workers, can 
barrier execution mode automatically coalesce the dataset to have # partitions 
== # tasks?
  4.  It would be convenient to be able to get network information about all 
other workers in the cluster that are in the same barrier execution, eg the 
host address and some task # or identifier of all workers

I would love to hear more about this new feature - also I had trouble finding 
documentation (JIRA: 
https://issues.apache.org/jira/browse/SPARK-24374<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-24374&data=02%7C01%7Cilmat%40microsoft.com%7Cd3eb38abd45848607bcd08d66208deff%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636804187576153094&sdata=NKtburcR5Td5IKc0DTwyn%2FfR%2BVmPGQk3sVlPg4K7GDY%3D&reserved=0>,
 High level design: 
https://www.slideshare.net/hadoop/the-zoo-expands?qid=b2efbd75-97af-4f71-9add-abf84970eaef&v=&b=&from_search=1<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fhadoop%2Fthe-zoo-expands%3Fqid%3Db2efbd75-97af-4f71-9add-abf84970eaef%26v%3D%26b%3D%26from_search%3D1&data=02%7C01%7Cilmat%40microsoft.com%7Cd3eb38abd45848607bcd08d66208deff%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636804187576173110&sdata=5cGR5LAQ0weXy6n%2FBC5xIQuUL47Gv9AhtLFyX2eohfg%3D&reserved=0>),
 are there any good examples of spark packages that have moved to use the new 
barrier execution mode in spark 2.4?

Thank you, Ilya

barrier execution mode with DataFrame and dynamic allocation

Reply via email to