Weichen Xu created SPARK-43790:
----------------------------------
Summary: Add API `copyLocalFileToHadoopFS`
Key: SPARK-43790
URL: https://issues.apache.org/jira/browse/SPARK-43790
Project: Spark
Issue Type: Sub-task
Components: Connect, ML, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu
In new distributed spark ML module (designed to support spark connect and
support local inference)
We need to save ML model to hadoop file system using custom binary file format,
the reason is:
* We often submit a spark application to spark cluster for running the
training model job, we need to save trained model to hadoop file system before
the spark application completes.
* But we want to support local model inference, that means if we save the
model by current spark DataFrame writer (e.g. parquet format), when loading
model we have to rely on the spark service. But we hope we can load model
without spark service. So we want the model being saved as the original binary
format that our ML code can handle.
So we need to add an API like `copyLocalFileToHadoopFS`,
The implementation could be:
(1) call `add_artifact` API to upload local file to spark driver (spark connect
already support this)
(2) In spark driver side, we can get `sparkContext.hadoopConf` and then using
Hadoop FileSystem API to upload file to Hadoop FS.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]