[jira] [Commented] (SPARK-8521) Feature Transformers in 1.5

2015-06-21 Thread Jao Rabary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595432#comment-14595432
 ] 

Jao Rabary commented on SPARK-8521:
---

I suggest to add PCA transformer. It's already in mllib.

> Feature Transformers in 1.5
> ---
>
> Key: SPARK-8521
> URL: https://issues.apache.org/jira/browse/SPARK-8521
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is a list of feature transformers we plan to add in Spark 1.5. Feel free 
> to propose useful transformers that are not on the list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models

2015-05-13 Thread Jao Rabary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541864#comment-14541864
 ] 

Jao Rabary edited comment on SPARK-3702 at 5/13/15 12:51 PM:
-

Are unsupervised learning algorithms also concerned with this standardization ? 
I would like to use algorithm such as kmeans with ml pipelines. How can one get 
started with that ?


was (Author: rajao):
Are unsupervised learning algorithm also concerned with this standardization ? 
I would like to use algorithm such as kmeans with ml pipelines. How can one get 
started with that ?

> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2015-05-13 Thread Jao Rabary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541864#comment-14541864
 ] 

Jao Rabary commented on SPARK-3702:
---

Are unsupervised learning algorithm also concerned with this standardization ? 
I would like to use algorithm such as kmeans with ml pipelines. How can one get 
started with that ?

> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT

2015-03-31 Thread Jao Rabary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388646#comment-14388646
 ] 

Jao Rabary commented on SPARK-5532:
---

I get the same problem with a DataFrame created with 
sqlContext.createDataFrame. Is this a related issue ? For example with the 
following code :

object TestDataFrame {

  def main(args: Array[String]): Unit = {

val conf = new SparkConf().setAppName("RankingEval").setMaster("local[4]")


val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10
val dataDF = data.toDF

dataDF.printSchema()
//dataDF.save("test1.parquet")

val dataDF2 = sqlContext.createDataFrame(dataDF.rdd, dataDF.schema)

dataDF2.printSchema()

dataDF2.saveAsParquetFile("test3.parquet")
  }
}

> Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
> 
>
> Key: SPARK-5532
> URL: https://issues.apache.org/jira/browse/SPARK-5532
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.3.0
>
>
> Deterministic failure:
> {code}
> import org.apache.spark.mllib.linalg._
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val data = sc.parallelize(Seq((1.0, 
> Vectors.dense(1,2,3.toDataFrame("label", "features")
> data.repartition(1).saveAsParquetFile("blah")
> {code}
> If you remove the repartition, then this succeeds.
> Here's the stack trace:
> {code}
> 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 
> 192.168.1.230): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
> org.apache.spark.sql.Row
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 7, 192.168.1.230): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
> org.apache.spark.sql.Row
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(P

[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2015-03-30 Thread Jao Rabary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386747#comment-14386747
 ] 

Jao Rabary commented on SPARK-3530:
---

Yes, the scenario is to instantiate a pre-trained caffe network. The problem 
with the broadcast is that I use a JNI binding of caffe and spark isn't able to 
serialize the object.



> Pipeline and Parameters
> ---
>
> Key: SPARK-3530
> URL: https://issues.apache.org/jira/browse/SPARK-3530
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.2.0
>
>
> This part of the design doc is for pipelines and parameters. I put the design 
> doc at
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
> I will copy the proposed interfaces to this JIRA later. Some sample code can 
> be viewed at: https://github.com/mengxr/spark-ml/
> Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3530) Pipeline and Parameters

2015-02-27 Thread Jao Rabary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339899#comment-14339899
 ] 

Jao Rabary edited comment on SPARK-3530 at 2/27/15 8:54 AM:


Some questions after playing a little with the new ml.Pipeline.

We mainly do large scale computer vision task (image classification, retrieval, 
...). The pipeline is really great stuff for that. We're trying to reproduce 
the tutorial given on that topic during the latest spark summit ( 
http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
 ) using the master version of spark pipeline and dataframe.  The tutorial 
shows different examples of feature extraction stages before running machine 
learning algorithms. Even the tutorial is straightforward to reproduce with 
this new API, we still have some questions :

- Can one use external tools (e.g via pipe) as a pipeline stage ?  An example 
of use case is to extract feature learned with convolutional neural network. In 
our case, this corresponds to a pre-trained neural network with Caffe library 
(http://caffe.berkeleyvision.org/) . 

- The second question is about the performance of the pipeline.  Library such 
as Caffe processes the data in batch and instancing one Caffe network can be 
time consuming when this network is very deep. So, we can gain performance if 
we minimize the number of Caffe network creation and give data in batch to the 
network. In the pipeline, this corresponds to run transformers that work on a 
partition basis and give the whole partition to a single caffe network. How can 
we create such a transformer ?


was (Author: rajao):
Some questions after playing a little with the new ml.Pipeline.

We mainly do large scale computer vision task (image classification, retreival, 
...). The pipeline is really great stuff for that. We're trying to reproduce 
the tutorial given on that topic during the latest spark summit ( 
http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
 ) using the master version of spark pipeline and dataframe.  The tutorial 
shows different examples of feature extraction stages before running machine 
learning algorithms. Even the tutorial is straightforward to reproduce with 
this new API, we still have some questions :

- Can one use external tools (e.g via pipe) as a pipeline stage ?  An example 
of use case is to extract feature learned with convolutional neural network. In 
our case, this corresponds to a pre-trained neural network with Caffe library 
(http://caffe.berkeleyvision.org/) . 

- The second question is about the performance of the pipeline.  Library such 
as Caffe processes the data in batch and instancing one Caffe network can be 
time consuming when this network is very deep. So, we can gain performance if 
we minimize the number of Caffe network creation and give data in batch to the 
network. In the pipeline, this corresponds to run transformers that work on a 
partition basis and give the whole partition to a single caffe network. How can 
we create such a transformer ?

> Pipeline and Parameters
> ---
>
> Key: SPARK-3530
> URL: https://issues.apache.org/jira/browse/SPARK-3530
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.2.0
>
>
> This part of the design doc is for pipelines and parameters. I put the design 
> doc at
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
> I will copy the proposed interfaces to this JIRA later. Some sample code can 
> be viewed at: https://github.com/mengxr/spark-ml/
> Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2015-02-27 Thread Jao Rabary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339899#comment-14339899
 ] 

Jao Rabary commented on SPARK-3530:
---

Some questions after playing a little with the new ml.Pipeline.

We mainly do large scale computer vision task (image classification, retreival, 
...). The pipeline is really great stuff for that. We're trying to reproduce 
the tutorial given on that topic during the latest spark summit ( 
http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
 ) using the master version of spark pipeline and dataframe.  The tutorial 
shows different examples of feature extraction stages before running machine 
learning algorithms. Even the tutorial is straightforward to reproduce with 
this new API, we still have some questions :

- Can one use external tools (e.g via pipe) as a pipeline stage ?  An example 
of use case is to extract feature learned with convolutional neural network. In 
our case, this corresponds to a pre-trained neural network with Caffe library 
(http://caffe.berkeleyvision.org/) . 

- The second question is about the performance of the pipeline.  Library such 
as Caffe processes the data in batch and instancing one Caffe network can be 
time consuming when this network is very deep. So, we can gain performance if 
we minimize the number of Caffe network creation and give data in batch to the 
network. In the pipeline, this corresponds to run transformers that work on a 
partition basis and give the whole partition to a single caffe network. How can 
we create such a transformer ?

> Pipeline and Parameters
> ---
>
> Key: SPARK-3530
> URL: https://issues.apache.org/jira/browse/SPARK-3530
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.2.0
>
>
> This part of the design doc is for pipelines and parameters. I put the design 
> doc at
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
> I will copy the proposed interfaces to this JIRA later. Some sample code can 
> be viewed at: https://github.com/mengxr/spark-ml/
> Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5924) Add the ability to specify withMean or withStd parameters with StandarScaler

2015-02-20 Thread Jao Rabary (JIRA)
Jao Rabary created SPARK-5924:
-

 Summary: Add the ability to specify withMean or withStd parameters 
with StandarScaler
 Key: SPARK-5924
 URL: https://issues.apache.org/jira/browse/SPARK-5924
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Jao Rabary
Priority: Trivial


The current implementation of StandarScaler calls mllib.feature.StandardScaler 
default constructor directly without offering the ability to specify withMean 
or withStd parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org