[jira] [Commented] (SPARK-8521) Feature Transformers in 1.5
[ https://issues.apache.org/jira/browse/SPARK-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595432#comment-14595432 ] Jao Rabary commented on SPARK-8521: --- I suggest to add PCA transformer. It's already in mllib. > Feature Transformers in 1.5 > --- > > Key: SPARK-8521 > URL: https://issues.apache.org/jira/browse/SPARK-8521 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is a list of feature transformers we plan to add in Spark 1.5. Feel free > to propose useful transformers that are not on the list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541864#comment-14541864 ] Jao Rabary edited comment on SPARK-3702 at 5/13/15 12:51 PM: - Are unsupervised learning algorithms also concerned with this standardization ? I would like to use algorithm such as kmeans with ml pipelines. How can one get started with that ? was (Author: rajao): Are unsupervised learning algorithm also concerned with this standardization ? I would like to use algorithm such as kmeans with ml pipelines. How can one get started with that ? > Standardize MLlib classes for learners, models > -- > > Key: SPARK-3702 > URL: https://issues.apache.org/jira/browse/SPARK-3702 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Summary: Create a class hierarchy for learning algorithms and the models > those algorithms produce. > This is a super-task of several sub-tasks (but JIRA does not allow subtasks > of subtasks). See the "requires" links below for subtasks. > Goals: > * give intuitive structure to API, both for developers and for generated > documentation > * support meta-algorithms (e.g., boosting) > * support generic functionality (e.g., evaluation) > * reduce code duplication across classes > [Design doc for class hierarchy | > https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541864#comment-14541864 ] Jao Rabary commented on SPARK-3702: --- Are unsupervised learning algorithm also concerned with this standardization ? I would like to use algorithm such as kmeans with ml pipelines. How can one get started with that ? > Standardize MLlib classes for learners, models > -- > > Key: SPARK-3702 > URL: https://issues.apache.org/jira/browse/SPARK-3702 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Summary: Create a class hierarchy for learning algorithms and the models > those algorithms produce. > This is a super-task of several sub-tasks (but JIRA does not allow subtasks > of subtasks). See the "requires" links below for subtasks. > Goals: > * give intuitive structure to API, both for developers and for generated > documentation > * support meta-algorithms (e.g., boosting) > * support generic functionality (e.g., evaluation) > * reduce code duplication across classes > [Design doc for class hierarchy | > https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388646#comment-14388646 ] Jao Rabary commented on SPARK-5532: --- I get the same problem with a DataFrame created with sqlContext.createDataFrame. Is this a related issue ? For example with the following code : object TestDataFrame { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("RankingEval").setMaster("local[4]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10 val dataDF = data.toDF dataDF.printSchema() //dataDF.save("test1.parquet") val dataDF2 = sqlContext.createDataFrame(dataDF.rdd, dataDF.schema) dataDF2.printSchema() dataDF2.saveAsParquetFile("test3.parquet") } } > Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT > > > Key: SPARK-5532 > URL: https://issues.apache.org/jira/browse/SPARK-5532 > Project: Spark > Issue Type: Bug > Components: MLlib, SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.3.0 > > > Deterministic failure: > {code} > import org.apache.spark.mllib.linalg._ > import org.apache.spark.sql.SQLContext > val sqlContext = new SQLContext(sc) > import sqlContext._ > val data = sc.parallelize(Seq((1.0, > Vectors.dense(1,2,3.toDataFrame("label", "features") > data.repartition(1).saveAsParquetFile("blah") > {code} > If you remove the repartition, then this succeeds. > Here's the stack trace: > {code} > 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, > 192.168.1.230): java.lang.ClassCastException: > org.apache.spark.mllib.linalg.DenseVector cannot be cast to > org.apache.spark.sql.Row > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; > aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 7, 192.168.1.230): java.lang.ClassCastException: > org.apache.spark.mllib.linalg.DenseVector cannot be cast to > org.apache.spark.sql.Row > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(P
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386747#comment-14386747 ] Jao Rabary commented on SPARK-3530: --- Yes, the scenario is to instantiate a pre-trained caffe network. The problem with the broadcast is that I use a JNI binding of caffe and spark isn't able to serialize the object. > Pipeline and Parameters > --- > > Key: SPARK-3530 > URL: https://issues.apache.org/jira/browse/SPARK-3530 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.2.0 > > > This part of the design doc is for pipelines and parameters. I put the design > doc at > https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing > I will copy the proposed interfaces to this JIRA later. Some sample code can > be viewed at: https://github.com/mengxr/spark-ml/ > Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339899#comment-14339899 ] Jao Rabary edited comment on SPARK-3530 at 2/27/15 8:54 AM: Some questions after playing a little with the new ml.Pipeline. We mainly do large scale computer vision task (image classification, retrieval, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit ( http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html ) using the master version of spark pipeline and dataframe. The tutorial shows different examples of feature extraction stages before running machine learning algorithms. Even the tutorial is straightforward to reproduce with this new API, we still have some questions : - Can one use external tools (e.g via pipe) as a pipeline stage ? An example of use case is to extract feature learned with convolutional neural network. In our case, this corresponds to a pre-trained neural network with Caffe library (http://caffe.berkeleyvision.org/) . - The second question is about the performance of the pipeline. Library such as Caffe processes the data in batch and instancing one Caffe network can be time consuming when this network is very deep. So, we can gain performance if we minimize the number of Caffe network creation and give data in batch to the network. In the pipeline, this corresponds to run transformers that work on a partition basis and give the whole partition to a single caffe network. How can we create such a transformer ? was (Author: rajao): Some questions after playing a little with the new ml.Pipeline. We mainly do large scale computer vision task (image classification, retreival, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit ( http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html ) using the master version of spark pipeline and dataframe. The tutorial shows different examples of feature extraction stages before running machine learning algorithms. Even the tutorial is straightforward to reproduce with this new API, we still have some questions : - Can one use external tools (e.g via pipe) as a pipeline stage ? An example of use case is to extract feature learned with convolutional neural network. In our case, this corresponds to a pre-trained neural network with Caffe library (http://caffe.berkeleyvision.org/) . - The second question is about the performance of the pipeline. Library such as Caffe processes the data in batch and instancing one Caffe network can be time consuming when this network is very deep. So, we can gain performance if we minimize the number of Caffe network creation and give data in batch to the network. In the pipeline, this corresponds to run transformers that work on a partition basis and give the whole partition to a single caffe network. How can we create such a transformer ? > Pipeline and Parameters > --- > > Key: SPARK-3530 > URL: https://issues.apache.org/jira/browse/SPARK-3530 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.2.0 > > > This part of the design doc is for pipelines and parameters. I put the design > doc at > https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing > I will copy the proposed interfaces to this JIRA later. Some sample code can > be viewed at: https://github.com/mengxr/spark-ml/ > Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339899#comment-14339899 ] Jao Rabary commented on SPARK-3530: --- Some questions after playing a little with the new ml.Pipeline. We mainly do large scale computer vision task (image classification, retreival, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit ( http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html ) using the master version of spark pipeline and dataframe. The tutorial shows different examples of feature extraction stages before running machine learning algorithms. Even the tutorial is straightforward to reproduce with this new API, we still have some questions : - Can one use external tools (e.g via pipe) as a pipeline stage ? An example of use case is to extract feature learned with convolutional neural network. In our case, this corresponds to a pre-trained neural network with Caffe library (http://caffe.berkeleyvision.org/) . - The second question is about the performance of the pipeline. Library such as Caffe processes the data in batch and instancing one Caffe network can be time consuming when this network is very deep. So, we can gain performance if we minimize the number of Caffe network creation and give data in batch to the network. In the pipeline, this corresponds to run transformers that work on a partition basis and give the whole partition to a single caffe network. How can we create such a transformer ? > Pipeline and Parameters > --- > > Key: SPARK-3530 > URL: https://issues.apache.org/jira/browse/SPARK-3530 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.2.0 > > > This part of the design doc is for pipelines and parameters. I put the design > doc at > https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing > I will copy the proposed interfaces to this JIRA later. Some sample code can > be viewed at: https://github.com/mengxr/spark-ml/ > Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5924) Add the ability to specify withMean or withStd parameters with StandarScaler
Jao Rabary created SPARK-5924: - Summary: Add the ability to specify withMean or withStd parameters with StandarScaler Key: SPARK-5924 URL: https://issues.apache.org/jira/browse/SPARK-5924 Project: Spark Issue Type: Improvement Components: ML Reporter: Jao Rabary Priority: Trivial The current implementation of StandarScaler calls mllib.feature.StandardScaler default constructor directly without offering the ability to specify withMean or withStd parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org