Michael Dreibelbis created SPARK-24597: ------------------------------------------
Summary: Spark ML Pipeline Should support non-linear models => DAGPipeline Key: SPARK-24597 URL: https://issues.apache.org/jira/browse/SPARK-24597 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.3.1 Reporter: Michael Dreibelbis Currently SparkML Pipeline/PipelineModel supports single linear dataset transformation despite the documentation stating otherwise: [reference documentation|https://spark.apache.org/docs/2.3.0/ml-pipeline.html#details] I'm proposing implementing a DAGPipeline and supporting multiple datasets as input The code could look something like this: {code:java} val ds1 = /*dataset 1 creation*/ val ds2 = /*dataset 2 creation*/ // nodes take on uid from estimator/transformer val i1 = IdentityNode(new IdentityTransformer("i1")) val i2 = IdentityNode(new IdentityTransformer("i2")) val bi = TransformerNode(new Binarizer("bi")) val cv = EstimatorNode(new CountVectorizer("cv")) val idf = EstimatorNode(new IDF("idf")) val j1 = JoinerNode(new Joiner("j1")) val nodes = Array(i1, i2, bi, cv, idf) val edges = Array( ("i1", "cv"), ("cv", "idf"), ("idf", "j1"), ("i2", "bi"), ("bi", "j1")) val p = new DAGPipeline(nodes, edges) .setIdentity("i1", ds1) .setIdentity("i2", ds2) val m = p.fit(spark.emptyDataFrame) m.setIdentity("i1", ds1).setIdentity("i2", ds2) m.transform(spark.emptyDataFrame) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org