[jira] [Created] (SPARK-24656) SparkML Transformers and Estimators with multiple columns

2018-06-25 Thread Michael Dreibelbis (JIRA)
Michael Dreibelbis created SPARK-24656:
--

 Summary: SparkML Transformers and Estimators with multiple columns
 Key: SPARK-24656
 URL: https://issues.apache.org/jira/browse/SPARK-24656
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.3.1
Reporter: Michael Dreibelbis


Currently SparkML Transformers and Estimators operate on single input/output 
column pairs. This makes pipelines extremely cumbersome (as well as 
non-performant) when transformations on multiple columns needs to be made.

 

I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model 
that would operate on the input columns in parallel.

 
{code:java}
 // old way
val pipeline = new Pipeline().setStages(Array(
  new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"),
  new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"),
  new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"),
  new IDF().setInputCol("_2_cv").setOutputCol("_2_idf")
))

// proposed way
val pipeline2 = new Pipeline().setStages(Array(
  new ParallelCountVectorizer().setInputCols(Array("_1", 
"_2")).setOutputCols(Array("_1_cv", "_2_cv")),
  new ParallelIDF().setInputCols(Array("_1_cv", 
"_2_cv")).setOutputCols(Array("_1_idf", "_2_idf"))
))

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24597) Spark ML Pipeline Should support non-linear models => DAGPipeline

2018-06-19 Thread Michael Dreibelbis (JIRA)
Michael Dreibelbis created SPARK-24597:
--

 Summary: Spark ML Pipeline Should support non-linear models => 
DAGPipeline
 Key: SPARK-24597
 URL: https://issues.apache.org/jira/browse/SPARK-24597
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.1
Reporter: Michael Dreibelbis


Currently SparkML Pipeline/PipelineModel supports single linear dataset 
transformation

despite the documentation stating otherwise:

[reference 
documentation|https://spark.apache.org/docs/2.3.0/ml-pipeline.html#details] 

 I'm proposing implementing a DAGPipeline and supporting multiple datasets as 
input

The code could look something like this:

 
{code:java}
val ds1 = /*dataset 1 creation*/
val ds2 = /*dataset 2 creation*/

// nodes take on uid from estimator/transformer
val i1 = IdentityNode(new IdentityTransformer("i1"))
val i2 = IdentityNode(new IdentityTransformer("i2"))
val bi = TransformerNode(new Binarizer("bi"))
val cv = EstimatorNode(new CountVectorizer("cv"))
val idf = EstimatorNode(new IDF("idf"))
val j1 = JoinerNode(new Joiner("j1"))
val nodes = Array(i1, i2, bi, cv, idf)
val edges = Array(
("i1", "cv"), ("cv", "idf"), ("idf", "j1"), 
("i2", "bi"), ("bi", "j1"))
val p = new DAGPipeline(nodes, edges)
.setIdentity("i1", ds1)
.setIdentity("i2", ds2)
val m = p.fit(spark.emptyDataFrame)
m.setIdentity("i1", ds1).setIdentity("i2", ds2)
m.transform(spark.emptyDataFrame)
{code}
 

 

         

          



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-03 Thread Michael Dreibelbis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Dreibelbis updated SPARK-22951:
---
Summary: count() after dropDuplicates() on emptyDataFrame returns incorrect 
value  (was: count() after dropDuplicates() on emptyDataFrame() returns 
incorrect value)

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Michael Dreibelbis
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame() returns incorrect value

2018-01-03 Thread Michael Dreibelbis (JIRA)
Michael Dreibelbis created SPARK-22951:
--

 Summary: count() after dropDuplicates() on emptyDataFrame() 
returns incorrect value
 Key: SPARK-22951
 URL: https://issues.apache.org/jira/browse/SPARK-22951
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: Michael Dreibelbis


here is a minimal Spark Application to reproduce:

{code}
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}


object DropDupesApp extends App {
  
  override def main(args: Array[String]): Unit = {
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local")
val sc = new SparkContext(conf)
val sql = SQLContext.getOrCreate(sc)
assert(sql.emptyDataFrame.count == 0) // expected
assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
  }
  
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org