[jira] [Created] (SPARK-50812) [Connect] Support spark.ml on Connect

Ruifeng Zheng (Jira) Tue, 14 Jan 2025 03:33:09 -0800

Ruifeng Zheng created SPARK-50812:
-------------------------------------

             Summary: [Connect] Support spark.ml on Connect
                 Key: SPARK-50812
                 URL: https://issues.apache.org/jira/browse/SPARK-50812
             Project: Spark
          Issue Type: Improvement
          Components: Connect, PySpark
    Affects Versions: 4.0.0
            Reporter: Ruifeng Zheng



Starting from Apache Spark 3.4, Spark has supported Connect which introduced a 
decoupled client-server architecture that allows remote connectivity to Spark 
clusters using the DataFrame API and unresolved logical plans as the protocol. 
The separation between client and server allows Spark and its open ecosystem to 
be leveraged from everywhere. It can be embedded in modern data applications, 
in IDEs, Notebooks and programming languages.

However, Spark Connect currently only supports Spark SQL, which means Spark ML 
could not run the training/inference via Spark Connect. It will probably result 
in losing some ML users.

So I would like to propose a way to support Spark ML on the Connect. Users 
don't need to change their code to leverage connect to run Spark ML cases.

Here are some links,

Design doc: [Support spark.ml on 
Connect|https://docs.google.com/document/d/1EUvSZuI-so83cxb_fTVMoz0vUfAaFmqXt39yoHI-D9I/edit?usp=sharing]
 

Draft PR: [https://github.com/wbo4958/spark/pull/5]

Example code,
{code:python}
spark = SparkSession.builder.remote("sc://localhost").getOrCreate()

df = spark.createDataFrame([
    (Vectors.dense([1.0, 2.0]), 1), 
    (Vectors.dense([2.0, -1.0]), 1), 
    (Vectors.dense([-3.0, -2.0]), 0), 
    (Vectors.dense([-1.0, -2.0]), 0), 
], schema=['features', 'label'])

lr = LogisticRegression()
lr.setMaxIter(30)

model: LogisticRegressionModel = lr.fit(df)
z = model.summary
x = model.predictRaw(Vectors.dense([1.0, 2.0]))
print(f"predictRaw {x}")
assert model.getMaxIter() == 30
model.summary.roc.show()

print(model.summary.weightedRecall)
print(model.summary.recallByLabel)
print(model.coefficients)
print(model.intercept)

model.transform(df).show()
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-50812) [Connect] Support spark.ml on Connect

Reply via email to