Weichen Xu created SPARK-43790:
----------------------------------

             Summary: Add API `copyLocalFileToHadoopFS`
                 Key: SPARK-43790
                 URL: https://issues.apache.org/jira/browse/SPARK-43790
             Project: Spark
          Issue Type: Sub-task
          Components: Connect, ML, PySpark
    Affects Versions: 3.5.0
            Reporter: Weichen Xu


In new distributed spark ML module (designed to support spark connect and 
support local inference)

We need to save ML model to hadoop file system using custom binary file format, 
the reason is:
 * We often submit a spark application to spark cluster for running the 
training model job, we need to save trained model to hadoop file system before 
the spark application completes.
 * But we want to support local model inference, that means if we save the 
model by current spark DataFrame writer (e.g. parquet format), when loading 
model we have to rely on the spark service. But we hope we can load model 
without spark service. So we want the model being saved as the original binary 
format that our ML code can handle.

 

So we need to add an API like `copyLocalFileToHadoopFS`,

The implementation could be:

 

(1) call `add_artifact` API to upload local file to spark driver (spark connect 
already support this)

(2) In spark driver side, we can get `sparkContext.hadoopConf` and then using 
Hadoop FileSystem API to upload file to Hadoop FS.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to