Hi All,
I worked on this idea a few years back as a pet project to bridge *SparkSQL*
and *SparkML* and empower anyone to implement production grade, distributed
machine learning over Apache Spark as long as they have SQL skills.

In principle the idea works exactly like Google's BigQueryML but at a much
wider scope with no vendor lock-in on basically every source that's
supported by Spark in cloud or on-prem.

*Training* a ML model can look like,

FIT 'LogisticRegression' ESTIMATOR WITH PARAMS(maxIter = 3) TO (
SELECT * FROM mlDataset) AND OVERWRITE AT LOCATION '/path/to/lr-model';

*Prediction* a ML model can look like,

PREDICT FOR (SELECT * FROM mlTestDataset) USING MODEL STORED AT
LOCATION '/path/to/lr-model'

*Feature Preprocessing* can look like,

TRANSFORM (SELECT * FROM dataset) using 'StopWordsRemover' TRANSFORMER WITH
PARAMS (inputCol='raw', outputCol='filtered') AND WRITE AT LOCATION
'/path/to/test-transformer'


But a lot more can be done with this library.

I was wondering if any of you find this interesting and would like to
contribute to the project here,

https://github.com/chitralverma/sparksql-ml


Regards,
Chitral Verma

Reply via email to