[jira] [Created] (SPARK-24656) SparkML Transformers and Estimators with multiple columns
Michael Dreibelbis created SPARK-24656: -- Summary: SparkML Transformers and Estimators with multiple columns Key: SPARK-24656 URL: https://issues.apache.org/jira/browse/SPARK-24656 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 2.3.1 Reporter: Michael Dreibelbis Currently SparkML Transformers and Estimators operate on single input/output column pairs. This makes pipelines extremely cumbersome (as well as non-performant) when transformations on multiple columns needs to be made. I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model that would operate on the input columns in parallel. {code:java} // old way val pipeline = new Pipeline().setStages(Array( new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"), new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"), new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"), new IDF().setInputCol("_2_cv").setOutputCol("_2_idf") )) // proposed way val pipeline2 = new Pipeline().setStages(Array( new ParallelCountVectorizer().setInputCols(Array("_1", "_2")).setOutputCols(Array("_1_cv", "_2_cv")), new ParallelIDF().setInputCols(Array("_1_cv", "_2_cv")).setOutputCols(Array("_1_idf", "_2_idf")) )) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24597) Spark ML Pipeline Should support non-linear models => DAGPipeline
Michael Dreibelbis created SPARK-24597: -- Summary: Spark ML Pipeline Should support non-linear models => DAGPipeline Key: SPARK-24597 URL: https://issues.apache.org/jira/browse/SPARK-24597 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.3.1 Reporter: Michael Dreibelbis Currently SparkML Pipeline/PipelineModel supports single linear dataset transformation despite the documentation stating otherwise: [reference documentation|https://spark.apache.org/docs/2.3.0/ml-pipeline.html#details] I'm proposing implementing a DAGPipeline and supporting multiple datasets as input The code could look something like this: {code:java} val ds1 = /*dataset 1 creation*/ val ds2 = /*dataset 2 creation*/ // nodes take on uid from estimator/transformer val i1 = IdentityNode(new IdentityTransformer("i1")) val i2 = IdentityNode(new IdentityTransformer("i2")) val bi = TransformerNode(new Binarizer("bi")) val cv = EstimatorNode(new CountVectorizer("cv")) val idf = EstimatorNode(new IDF("idf")) val j1 = JoinerNode(new Joiner("j1")) val nodes = Array(i1, i2, bi, cv, idf) val edges = Array( ("i1", "cv"), ("cv", "idf"), ("idf", "j1"), ("i2", "bi"), ("bi", "j1")) val p = new DAGPipeline(nodes, edges) .setIdentity("i1", ds1) .setIdentity("i2", ds2) val m = p.fit(spark.emptyDataFrame) m.setIdentity("i1", ds1).setIdentity("i2", ds2) m.transform(spark.emptyDataFrame) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Dreibelbis updated SPARK-22951: --- Summary: count() after dropDuplicates() on emptyDataFrame returns incorrect value (was: count() after dropDuplicates() on emptyDataFrame() returns incorrect value) > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Michael Dreibelbis > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame() returns incorrect value
Michael Dreibelbis created SPARK-22951: -- Summary: count() after dropDuplicates() on emptyDataFrame() returns incorrect value Key: SPARK-22951 URL: https://issues.apache.org/jira/browse/SPARK-22951 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2 Reporter: Michael Dreibelbis here is a minimal Spark Application to reproduce: {code} import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} object DropDupesApp extends App { override def main(args: Array[String]): Unit = { val conf = new SparkConf() .setAppName("test") .setMaster("local") val sc = new SparkContext(conf) val sql = SQLContext.getOrCreate(sc) assert(sql.emptyDataFrame.count == 0) // expected assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org