[jira] [Updated] (SPARK-44264) DeepSpeed Distrobutor

2023-07-18 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-44264:
-
Attachment: Trying to Run Deepspeed Funcs.html

> DeepSpeed Distrobutor
> -
>
> Key: SPARK-44264
> URL: https://issues.apache.org/jira/browse/SPARK-44264
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.4.1
>Reporter: Lu Wang
>Priority: Critical
> Fix For: 3.5.0
>
> Attachments: Trying to Run Deepspeed Funcs.html
>
>
> To make it easier for Pyspark users to run distributed training and inference 
> with DeepSpeed on spark clusters using PySpark. This was a project determined 
> by the Databricks ML Training Team.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42103) Add Instrumentation

2023-01-24 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani resolved SPARK-42103.
--
Resolution: Not A Problem

> Add Instrumentation
> ---
>
> Key: SPARK-42103
> URL: https://issues.apache.org/jira/browse/SPARK-42103
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Adding instrumentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41590) Implement Baseline API Code

2023-01-24 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani resolved SPARK-41590.
--
Resolution: Fixed

> Implement Baseline API Code
> ---
>
> Key: SPARK-41590
> URL: https://issues.apache.org/jira/browse/SPARK-41590
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Creating a baseline API so that we can agree on how the users will interact 
> with the code. This was determined in this [Design 
> Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
>  and can be updated as necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41916) Address General Fixes

2023-01-20 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41916:
-
Description: 
We want the distributor to have the ability to run multiple torchrun processes 
per task if task.gpu.amount > 1.

We want to add a check to see if `import torch` doesn't raise an ImportError 
since the TorchDistributor requires torch. If it raises an ImportError, we will 
give the user more details. 

  was:We want the distributor to have the ability to run multiple torchrun 
processes per task if task.gpu.amount > 1.


> Address General Fixes
> -
>
> Key: SPARK-41916
> URL: https://issues.apache.org/jira/browse/SPARK-41916
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> We want the distributor to have the ability to run multiple torchrun 
> processes per task if task.gpu.amount > 1.
> We want to add a check to see if `import torch` doesn't raise an ImportError 
> since the TorchDistributor requires torch. If it raises an ImportError, we 
> will give the user more details. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41916) Address General Fizes

2023-01-20 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41916:
-
Summary: Address General Fizes  (was: Address 
`spark.task.resource.gpu.amount > 1`)

> Address General Fizes
> -
>
> Key: SPARK-41916
> URL: https://issues.apache.org/jira/browse/SPARK-41916
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> We want the distributor to have the ability to run multiple torchrun 
> processes per task if task.gpu.amount > 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41776) Implement support for PyTorch Lightning

2023-01-20 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani resolved SPARK-41776.
--
Resolution: Fixed

Not needed, since we are now using `torch.distributed.run`

> Implement support for PyTorch Lightning
> ---
>
> Key: SPARK-41776
> URL: https://issues.apache.org/jira/browse/SPARK-41776
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This requires us to just call train() on each spark task separately without 
> much preprocessing or postprocessing because PyTorch Lightning handles that 
> by itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41916) Address `spark.task.resource.gpu.amount > 1`

2023-01-20 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41916:
-
Description: We want the distributor to have the ability to run multiple 
torchrun processes per task if task.gpu.amount > 1.  (was: We want the 
distributor to have the ability to run multiple torchrun processes per task if 
task.gpu.amount > 1 + address formatting comments on 
https://github.com/apache/spark/pull/39188#discussion_r1068903058)

> Address `spark.task.resource.gpu.amount > 1`
> 
>
> Key: SPARK-41916
> URL: https://issues.apache.org/jira/browse/SPARK-41916
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> We want the distributor to have the ability to run multiple torchrun 
> processes per task if task.gpu.amount > 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41916) Address General Fixes

2023-01-20 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41916:
-
Summary: Address General Fixes  (was: Address General Fizes)

> Address General Fixes
> -
>
> Key: SPARK-41916
> URL: https://issues.apache.org/jira/browse/SPARK-41916
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> We want the distributor to have the ability to run multiple torchrun 
> processes per task if task.gpu.amount > 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41916) Address `spark.task.resource.gpu.amount > 1`

2023-01-18 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41916:
-
Description: We want the distributor to have the ability to run multiple 
torchrun processes per task if task.gpu.amount > 1 + address formatting 
comments on https://github.com/apache/spark/pull/39188#discussion_r1068903058  
(was: We want the distributor to have the ability to run multiple torchrun 
processes per task if task.gpu.amount > 1.)

> Address `spark.task.resource.gpu.amount > 1`
> 
>
> Key: SPARK-41916
> URL: https://issues.apache.org/jira/browse/SPARK-41916
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> We want the distributor to have the ability to run multiple torchrun 
> processes per task if task.gpu.amount > 1 + address formatting comments on 
> https://github.com/apache/spark/pull/39188#discussion_r1068903058



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41776) Implement support for PyTorch Lightning

2023-01-17 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41776:
-
Description: 
This requires us to just call train() on each spark task separately without 
much preprocessing or postprocessing because PyTorch Lightning handles that by 
itself.

 

Update: This was resolved by using `torch.distributed.run`

  was:This requires us to just call train() on each spark task separately 
without much preprocessing or postprocessing because PyTorch Lightning handles 
that by itself.


> Implement support for PyTorch Lightning
> ---
>
> Key: SPARK-41776
> URL: https://issues.apache.org/jira/browse/SPARK-41776
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This requires us to just call train() on each spark task separately without 
> much preprocessing or postprocessing because PyTorch Lightning handles that 
> by itself.
>  
> Update: This was resolved by using `torch.distributed.run`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41776) Implement support for PyTorch Lightning

2023-01-17 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41776:
-
Description: This requires us to just call train() on each spark task 
separately without much preprocessing or postprocessing because PyTorch 
Lightning handles that by itself.  (was: This requires us to just call train() 
on each spark task separately without much preprocessing or postprocessing 
because PyTorch Lightning handles that by itself.

 

Update: This was resolved by using `torch.distributed.run`)

> Implement support for PyTorch Lightning
> ---
>
> Key: SPARK-41776
> URL: https://issues.apache.org/jira/browse/SPARK-41776
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This requires us to just call train() on each spark task separately without 
> much preprocessing or postprocessing because PyTorch Lightning handles that 
> by itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41915) Change API so that the user doesn't have to explicitly set pytorch-lightning

2023-01-17 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani resolved SPARK-41915.
--
Resolution: Fixed

This is already resolved within 
https://issues.apache.org/jira/browse/SPARK-41590.

> Change API so that the user doesn't have to explicitly set pytorch-lightning
> 
>
> Key: SPARK-41915
> URL: https://issues.apache.org/jira/browse/SPARK-41915
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Removing the `framework` parameter in the API and have cloudpickle 
> automatically find out whether the user code has a dependency on PyTorch 
> Lightning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42103) Add Instrumentation

2023-01-17 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-42103:


 Summary: Add Instrumentation
 Key: SPARK-42103
 URL: https://issues.apache.org/jira/browse/SPARK-42103
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


Adding instrumentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41775) Implement training functions as input

2023-01-16 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41775:
-
Description: 
Sidenote: make formatting updates described in 
https://github.com/apache/spark/pull/39188

 

Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
{code:java}
import cloudpickle
import os
if _name_ == "_main_":
    train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
    output = train(*args)
    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output through `.collect()`

  was:
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
{code:java}
import cloudpickle
import os
if _name_ == "_main_":
    train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
    output = train(*args)
    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output through `.collect()`


> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Sidenote: make formatting updates described in 
> https://github.com/apache/spark/pull/39188
>  
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> {code:java}
> import cloudpickle
> import os
> if _name_ == "_main_":
>     train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output through `.collect()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41916) Address `spark.task.resource.gpu.amount > 1`

2023-01-05 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41916:


 Summary: Address `spark.task.resource.gpu.amount > 1`
 Key: SPARK-41916
 URL: https://issues.apache.org/jira/browse/SPARK-41916
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


We want the distributor to have the ability to run multiple torchrun processes 
per task if task.gpu.amount > 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41915) Change API so that the user doesn't have to explicitly set pytorch-lightning

2023-01-05 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41915:


 Summary: Change API so that the user doesn't have to explicitly 
set pytorch-lightning
 Key: SPARK-41915
 URL: https://issues.apache.org/jira/browse/SPARK-41915
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


Removing the `framework` parameter in the API and have cloudpickle 
automatically find out whether the user code has a dependency on PyTorch 
Lightning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41775) Implement training functions as input

2022-12-29 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41775:
-
Description: 
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
{code:java}
import cloudpickle
import os
if _name_ == "_main_":
    train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
    output = train(*args)
    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output through `.collect()`

  was:
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
{code:java}
import cloudpickle
import os
if _name_ == "_main_":
    train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
    output = train(*args)
    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output


> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> {code:java}
> import cloudpickle
> import os
> if _name_ == "_main_":
>     train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output through `.collect()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41775) Implement training functions as input

2022-12-29 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41775:
-
Description: 
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like

 
{code:java}
import cloudpickle
import os
if _name_ == "_main_":
    train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
    output = train(*args)
    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output

  was:
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
```

import cloudpickle

import os

if _{_}name{_}_ == "_{_}main{_}_":

    train, args = cloudpickle.load(f"\{tempdir}/train_input.pkl")

    output = train(*args)

    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"\{tempdir}/train_output.pkl")

```

3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output


> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
>  
> {code:java}
> import cloudpickle
> import os
> if _name_ == "_main_":
>     train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41775) Implement training functions as input

2022-12-29 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41775:
-
Description: 
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
{code:java}
import cloudpickle
import os
if _name_ == "_main_":
    train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
    output = train(*args)
    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output

  was:
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like

 
{code:java}
import cloudpickle
import os
if _name_ == "_main_":
    train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
    output = train(*args)
    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output


> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> {code:java}
> import cloudpickle
> import os
> if _name_ == "_main_":
>     train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41775) Implement training functions as input

2022-12-29 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41775:
-
Description: 
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
```

import cloudpickle

import os

if _{_}name{_}_ == "_{_}main{_}_":

    train, args = cloudpickle.load(f"\{tempdir}/train_input.pkl")

    output = train(*args)

    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"\{tempdir}/train_output.pkl")

```

3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output

  was:
Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
```python

import cloudpickle

import os

if __name__ == "__main__":

    train, args = cloudpickle.load(f"\{tempdir}/train_input.pkl")

    output = train(*args)

    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"\{tempdir}/train_output.pkl")

```

3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output


> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> ```
> import cloudpickle
> import os
> if _{_}name{_}_ == "_{_}main{_}_":
>     train, args = cloudpickle.load(f"\{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"\{tempdir}/train_output.pkl")
> ```
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41777) Add Integration Tests

2022-12-29 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41777:


 Summary: Add Integration Tests
 Key: SPARK-41777
 URL: https://issues.apache.org/jira/browse/SPARK-41777
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


This requires us to add PyTorch as a testing dependency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41776) Implement support for PyTorch Lightning

2022-12-29 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41776:


 Summary: Implement support for PyTorch Lightning
 Key: SPARK-41776
 URL: https://issues.apache.org/jira/browse/SPARK-41776
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


This requires us to just call train() on each spark task separately without 
much preprocessing or postprocessing because PyTorch Lightning handles that by 
itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41775) Implement training functions as input

2022-12-29 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41775:
-
Component/s: ML

> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> ```python
> import cloudpickle
> import os
> if __name__ == "__main__":
>     train, args = cloudpickle.load(f"\{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"\{tempdir}/train_output.pkl")
> ```
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41775) Implement training functions as input

2022-12-29 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41775:


 Summary: Implement training functions as input
 Key: SPARK-41775
 URL: https://issues.apache.org/jira/browse/SPARK-41775
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


Currently, `Distributor().run(...)` takes only files as input. Now we will add 
in additional functionality to take in functions as well. This will require us 
to go through the following process on each task in the executor nodes:
1. take the input function and args and pickle them
2. Create a temp train.py file that looks like
```python

import cloudpickle

import os

if __name__ == "__main__":

    train, args = cloudpickle.load(f"\{tempdir}/train_input.pkl")

    output = train(*args)

    if output and os.environ.get("RANK", "") == "0": # this is for partitionId 
== 0
        cloudpickle.dump(f"\{tempdir}/train_output.pkl")

```

3. Run that train.py file with `torchrun`

4. Check if `train_output.pkl` has been created on process on partitionId == 0, 
if it has, then deserialize it and return that output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41589) PyTorch Distributor

2022-12-20 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41589:
-
Description: This is a project to make it easier for PySpark users to 
distribute PyTorch code using PySpark. The corresponding [Design 
Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
 can give more context. This was a project determined by the Databricks ML 
Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] for 
more context.  (was: This is a project to make it easier for PySpark users to 
distribute PyTorch code using PySpark. The corresponding [Design 
Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
 can give more context. This was a project determined by the Databricks ML 
Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
[~erithwik] for more context.)

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] 
> for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649526#comment-17649526
 ] 

Rithwik Ediga Lakhamsani commented on SPARK-41589:
--

[~xkrogen] I created a new copy, please let me know if you still can't see it. 
Thank you for your patience!

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
> [~erithwik] for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41589:
-
Description: This is a project to make it easier for PySpark users to 
distribute PyTorch code using PySpark. The corresponding [Design 
Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
 can give more context. This was a project determined by the Databricks ML 
Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
[~erithwik] for more context.  (was: This is a project to make it easier for 
PySpark users to distribute PyTorch code using PySpark. The corresponding 
[Design 
Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
 and 
[PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
 can give more context. This was a project determined by the Databricks ML 
Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
[~erithwik] for more context.)

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
> [~erithwik] for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649516#comment-17649516
 ] 

Rithwik Ediga Lakhamsani commented on SPARK-41589:
--

Sorry, I need update it with a new copy. I will add a new comment on this 
ticket when the new document should be available.

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
>  and 
> [PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
> [~erithwik] for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649511#comment-17649511
 ] 

Rithwik Ediga Lakhamsani edited comment on SPARK-41589 at 12/20/22 12:27 AM:
-

Oh sorry, let me fix that! Does it work now [~xkrogen]?


was (Author: JIRAUSER298573):
Oh sorry, let me fix that! 

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
>  and 
> [PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
> [~erithwik] for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41589:
-
Description: This is a project to make it easier for PySpark users to 
distribute PyTorch code using PySpark. The corresponding [Design 
Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
 and 
[PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
 can give more context. This was a project determined by the Databricks ML 
Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
[~erithwik] for more context.  (was: This is a project to make it easier for 
PySpark users to distribute PyTorch code using PySpark. The corresponding 
[Design 
Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
 and 
[PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
 can give more context. )

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
>  and 
> [PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side proxy) or 
> [~erithwik] for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649511#comment-17649511
 ] 

Rithwik Ediga Lakhamsani commented on SPARK-41589:
--

Oh sorry, let me fix that! 

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
>  and 
> [PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
>  can give more context. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41592) Implement functionality for training a PyTorch file on the executors

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41592:


 Summary: Implement functionality for training a PyTorch file on 
the executors
 Key: SPARK-41592
 URL: https://issues.apache.org/jira/browse/SPARK-41592
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41593) Implement logging from the executor nodes

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41593:


 Summary: Implement logging from the executor nodes
 Key: SPARK-41593
 URL: https://issues.apache.org/jira/browse/SPARK-41593
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41591) Implement functionality for training a PyTorch file locally

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41591:


 Summary: Implement functionality for training a PyTorch file 
locally
 Key: SPARK-41591
 URL: https://issues.apache.org/jira/browse/SPARK-41591
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649509#comment-17649509
 ] 

Rithwik Ediga Lakhamsani commented on SPARK-41589:
--

I am working on this.

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
>  and 
> [PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
>  can give more context. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41589:
-
Description: This is a project to make it easier for PySpark users to 
distribute PyTorch code using PySpark. The corresponding [Design 
Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
 and 
[PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
 can give more context. 

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
>  and 
> [PRD|https://docs.google.com/document/d/1KprHkzx9r3lv47TLgO6FnkYZT92xOx6OeKvTJPxqpfk/edit]
>  can give more context. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41590) Implement Baseline API Code

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41590:


 Summary: Implement Baseline API Code
 Key: SPARK-41590
 URL: https://issues.apache.org/jira/browse/SPARK-41590
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


Creating a baseline API so that we can agree on how the users will interact 
with the code. This was determined in this [Design 
Document|https://docs.google.com/document/d/1_nhUP46cHnYmnZoyirySXvuY1KDMU3vdHRx9MngSVtA/edit]
 and can be updated as necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41589) PyTorch Distributor

2022-12-19 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-41589:


 Summary: PyTorch Distributor
 Key: SPARK-41589
 URL: https://issues.apache.org/jira/browse/SPARK-41589
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org