[ https://issues.apache.org/jira/browse/SPARK-43790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu reassigned SPARK-43790: ---------------------------------- Assignee: Weichen Xu > Add API `copyLocalFileToHadoopFS` > --------------------------------- > > Key: SPARK-43790 > URL: https://issues.apache.org/jira/browse/SPARK-43790 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark > Affects Versions: 3.5.0 > Reporter: Weichen Xu > Assignee: Weichen Xu > Priority: Major > > In new distributed spark ML module (designed to support spark connect and > support local inference) > We need to save ML model to hadoop file system using custom binary file > format, the reason is: > * We often submit a spark application to spark cluster for running the > training model job, we need to save trained model to hadoop file system > before the spark application completes. > * But we want to support local model inference, that means if we save the > model by current spark DataFrame writer (e.g. parquet format), when loading > model we have to rely on the spark service. But we hope we can load model > without spark service. So we want the model being saved as the original > binary format that our ML code can handle. > > So we need to add an API like `copyLocalFileToHadoopFS`, > The implementation could be: > > (1) call `add_artifact` API to upload local file to spark driver (spark > connect already support this) > (2) In spark driver side, we can get `sparkContext.hadoopConf` and then using > Hadoop FileSystem API to upload file to Hadoop FS. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org