[jira] [Commented] (SPARK-22974) CountVectorModel does not attach attributes to output column

2019-05-03 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832691#comment-16832691
 ] 

yuhao yang commented on SPARK-22974:


On a business trip from April 29th to May 3rd . Please expect delayed email 
response. Conctact +1 669 243 8273for anything urgent.

Thanks,
Yuhao


> CountVectorModel does not attach attributes to output column
> 
>
> Key: SPARK-22974
> URL: https://issues.apache.org/jira/browse/SPARK-22974
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: William Zhang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> If CountVectorModel transforms columns, the output column will not have 
> attributes attached to it. If later on, those output columns are used in 
> Interaction transformer, an exception will be thrown:
> {quote}"org.apache.spark.SparkException: Vector attributes must be defined 
> for interaction."
> {quote}
> To reproduce it:
> {quote}import org.apache.spark.ml.feature._
>  import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>  (0, Array("a", "b", "c"), Array("1", "2")),
>  (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3"))
>  )).toDF("id", "words", "nums")
> val cvModel: CountVectorizerModel = new CountVectorizer()
>  .setInputCol("nums")
>  .setOutputCol("features2")
>  .setVocabSize(4)
>  .setMinDF(0)
>  .fit(df)
> val cvm = new CountVectorizerModel(Array("a", "b", "c"))
>  .setInputCol("words")
>  .setOutputCol("features1")
> val df1 = cvm.transform(df)
>  val df2 = cvModel.transform(df1)
> val interaction = new Interaction().setInputCols(Array("features1", 
> "features2")).setOutputCol("features")
>  val df3 = interaction.transform(df2)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2019-03-13 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791938#comment-16791938
 ] 

yuhao yang commented on SPARK-20082:


Yuhao is taking family bonding leave from March 7th to Apr 19th . Please expect 
delayed email response. Conctact +86 13738085700 for anything urgent.

Thanks,
Yuhao


> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu DESPRIEE
>Priority: Major
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25011) Add PrefixSpan to __all__ in fpm.py

2018-08-03 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-25011:
---
Summary: Add PrefixSpan to __all__ in fpm.py  (was: Add PrefixSpan to 
__all__)

> Add PrefixSpan to __all__ in fpm.py
> ---
>
> Key: SPARK-25011
> URL: https://issues.apache.org/jira/browse/SPARK-25011
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add PrefixSpan to __all__ in fpm.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25011) Add PrefixSpan to __all__

2018-08-02 Thread yuhao yang (JIRA)
yuhao yang created SPARK-25011:
--

 Summary: Add PrefixSpan to __all__
 Key: SPARK-25011
 URL: https://issues.apache.org/jira/browse/SPARK-25011
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.0
Reporter: yuhao yang


Add PrefixSpan to __all__ in fpm.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23742) Filter out redundant AssociationRules

2018-08-01 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566326#comment-16566326
 ] 

yuhao yang commented on SPARK-23742:


[~maropu] Can you be more specific about the suggestion? E.g. how would it work 
with the example in the description. Thanks

> Filter out redundant AssociationRules
> -
>
> Key: SPARK-23742
> URL: https://issues.apache.org/jira/browse/SPARK-23742
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> AssociationRules can generate redundant rules such as:
> * (A) => C
> * (A,B) => C  (redundant)
> It should optionally filter out redundant rules.  It'd be nice to have it 
> optional (but maybe defaulting to filtering) so that users could compare the 
> confidences of more general vs. more specific rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23742) Filter out redundant AssociationRules

2018-08-01 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564858#comment-16564858
 ] 

yuhao yang commented on SPARK-23742:


The redundant rule may have different confidence and support.

> Filter out redundant AssociationRules
> -
>
> Key: SPARK-23742
> URL: https://issues.apache.org/jira/browse/SPARK-23742
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> AssociationRules can generate redundant rules such as:
> * (A) => C
> * (A,B) => C  (redundant)
> It should optionally filter out redundant rules.  It'd be nice to have it 
> optional (but maybe defaulting to filtering) so that users could compare the 
> confidences of more general vs. more specific rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15064) Locale support in StopWordsRemover

2018-06-06 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502929#comment-16502929
 ] 

yuhao yang commented on SPARK-15064:


Yuhao will be OOF from May 29th to June 6th (annual leave and conference). 
Please expect delayed email response. Conctact 669 243 8273 for anything urgent.

Regards,
Yuhao


> Locale support in StopWordsRemover
> --
>
> Key: SPARK-15064
> URL: https://issues.apache.org/jira/browse/SPARK-15064
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We support case insensitive filtering (default) in StopWordsRemover. However, 
> case insensitive matching depends on the locale and region, which cannot be 
> explicitly set in StopWordsRemover. We should consider adding this support in 
> MLlib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-01-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328310#comment-16328310
 ] 

yuhao yang commented on SPARK-22943:


Thanks for the reply, yet I cannot see how can user specify the output 
dimension right now.

> OneHotEncoder supports manual specification of categorySizes
> 
>
> Key: SPARK-22943
> URL: https://issues.apache.org/jira/browse/SPARK-22943
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> OHE should support configurable categorySizes, as n-values in  
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
>  which allows consistent and foreseeable conversion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-01-05 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314412#comment-16314412
 ] 

yuhao yang commented on SPARK-22943:


Feel free to work on this but I would suggest to get green light from committer 
first.

> OneHotEncoder supports manual specification of categorySizes
> 
>
> Key: SPARK-22943
> URL: https://issues.apache.org/jira/browse/SPARK-22943
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> OHE should support configurable categorySizes, as n-values in  
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
>  which allows consistent and foreseeable conversion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-01-02 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22943:
--

 Summary: OneHotEncoder supports manual specification of 
categorySizes
 Key: SPARK-22943
 URL: https://issues.apache.org/jira/browse/SPARK-22943
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


OHE should support configurable categorySizes, as n-values in  
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
 which allows consistent and foreseeable conversion.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19053) Supporting multiple evaluation metrics in DataFrame-based API: discussion

2017-12-19 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297887#comment-16297887
 ] 

yuhao yang commented on SPARK-19053:


Plan for further development:

1. Initial API and function parity with ML Evaluators. (This PR)
2. Python API.
3. Function parity with MLlib Metrics.
4. Add requested enhancements like including weight, add per-row metrics, add 
ranking metrics.
5. Reorganize classification Metrics hierarchy, so Binary Classification 
Metrics can support metrics in MultiClassMetrics (accuracy, recall etc.).
6. Possibly to be used in training summary.

> Supporting multiple evaluation metrics in DataFrame-based API: discussion
> -
>
> Key: SPARK-19053
> URL: https://issues.apache.org/jira/browse/SPARK-19053
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is to discuss supporting the computation of multiple evaluation 
> metrics efficiently in the DataFrame-based API for MLlib.
> In the RDD-based API, RegressionMetrics and other *Metrics classes support 
> efficient computation of multiple metrics.
> In the DataFrame-based API, there are a few options:
> * model/result summaries (e.g., LogisticRegressionSummary): These currently 
> provide the desired functionality, but they require a model and do not let 
> users compute metrics manually from DataFrames of predictions and true labels.
> * Evaluator classes (e.g., RegressionEvaluator): These only support computing 
> a single metric in one pass over the data, but they do not require a model.
> * new class analogous to Metrics: We could introduce a class analogous to 
> Metrics.  Model/result summaries could use this internally as a replacement 
> for spark.mllib Metrics classes, or they could (maybe) inherit from these 
> classes.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-02 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275723#comment-16275723
 ] 

yuhao yang commented on SPARK-8418:
---

second Nick's comments.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22331) Make MLlib string params case-insensitive

2017-11-28 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269169#comment-16269169
 ] 

yuhao yang commented on SPARK-22331:


Thanks for the interests [~smurakozi]. I tried to support this with 
StringParams (check related jira) but it's not getting any feedback. 

So feel free to start with other options. 


> Make MLlib string params case-insensitive
> -
>
> Key: SPARK-22331
> URL: https://issues.apache.org/jira/browse/SPARK-22331
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> Some String params in ML are still case-sensitive, as they are checked by 
> ParamValidators.inArray.
> For consistency in user experience, there should be some general guideline in 
> whether String params in Spark MLlib are case-insensitive or not. 
> I'm leaning towards making all String params case-insensitive where possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-20 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259587#comment-16259587
 ] 

yuhao yang commented on SPARK-22427:


I tried with larger scale data but did not repro the issue. [~lyt] Can you 
please provide the reference for your dataset, or some size info? Thanks.

> StackOverFlowError when using FPGrowth
> --
>
> Key: SPARK-22427
> URL: https://issues.apache.org/jira/browse/SPARK-22427
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
> Environment: Centos Linux 3.10.0-327.el7.x86_64
> java 1.8.0.111
> spark 2.2.0
>Reporter: lyt
>
> code part:
> val path = jobConfig.getString("hdfspath")
> val vectordata = sc.sparkContext.textFile(path)
> val finaldata = sc.createDataset(vectordata.map(obj => {
>   obj.split(" ")
> }).filter(arr => arr.length > 0)).toDF("items")
> val fpg = new FPGrowth()
> 
> fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
> val train = fpg.fit(finaldata)
> print(train.freqItemsets.count())
> print(train.associationRules.count())
> train.save("/tmp/FPGModel")
> And encountered following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
>   at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
>   at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
>   at DataMining.testFPG$.main(FPGrowth.scala:36)
>   at DataMining.testFPG.main(FPGrowth.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at 

[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-12 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249017#comment-16249017
 ] 

yuhao yang commented on SPARK-22427:


Hi [~lyt] does increasing stack size resolve your issue? If not I will look 
into it.

> StackOverFlowError when using FPGrowth
> --
>
> Key: SPARK-22427
> URL: https://issues.apache.org/jira/browse/SPARK-22427
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
> Environment: Centos Linux 3.10.0-327.el7.x86_64
> java 1.8.0.111
> spark 2.2.0
>Reporter: lyt
>
> code part:
> val path = jobConfig.getString("hdfspath")
> val vectordata = sc.sparkContext.textFile(path)
> val finaldata = sc.createDataset(vectordata.map(obj => {
>   obj.split(" ")
> }).filter(arr => arr.length > 0)).toDF("items")
> val fpg = new FPGrowth()
> 
> fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
> val train = fpg.fit(finaldata)
> print(train.freqItemsets.count())
> print(train.associationRules.count())
> train.save("/tmp/FPGModel")
> And encountered following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
>   at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
>   at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
>   at DataMining.testFPG$.main(FPGrowth.scala:36)
>   at DataMining.testFPG.main(FPGrowth.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at 

[jira] [Created] (SPARK-22502) OnlineLDAOptimizer variationalTopicInference might be able to handle empty documents

2017-11-12 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22502:
--

 Summary: OnlineLDAOptimizer variationalTopicInference might be 
able to handle empty documents
 Key: SPARK-22502
 URL: https://issues.apache.org/jira/browse/SPARK-22502
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Trivial


Currently we assume OnlineLDAOptimizer.variationalTopicInference cannot take 
empty documents and added a few checks during training and inference. Yet I 
tested and in my local env sending empty vectors to  
OnlineLDAOptimizer.variationalTopicInference does not trigger any error.

If this is true, maybe we can remove the extra check. Please be cautious as 
compared with the gain (some code cleaning and little performance improvement), 
we do want to avoid a regression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18755) Add Randomized Grid Search to Spark ML

2017-11-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247870#comment-16247870
 ] 

yuhao yang commented on SPARK-18755:


Thanks for all the interests. 

For anyone who wants to contribute on the item, IMO we need to support the 
Random Grid Search function as in sklearn or other popular libraries. The 
initial PR can start with a basic prototype but should contain plans for 
supporting future extension for function parity. Also since we add the 
randomized search primarily to speedup the tuning process, it's best if we can 
present some benchmark on public dataset to demonstrate the effectiveness.

also cc [~srowen] [~mlnick] [~yanboliang] [~holdenk] to see if any one has 
bandwidth to Shepherd this. I can help review. Thanks.


> Add Randomized Grid Search to Spark ML
> --
>
> Key: SPARK-18755
> URL: https://issues.apache.org/jira/browse/SPARK-18755
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> Randomized Grid Search  implements a randomized search over parameters, where 
> each setting is sampled from a distribution over possible parameter values. 
> This has two main benefits over an exhaustive search:
> 1. A budget can be chosen independent of the number of parameters and 
> possible values.
> 2. Adding parameters that do not influence the performance does not decrease 
> efficiency.
> Randomized Grid search usually gives similar result as exhaustive search, 
> while the run time for randomized search is drastically lower.
> For more background, please refer to:
> sklearn: http://scikit-learn.org/stable/modules/grid_search.html
> http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
> http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
> https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.
> There're two ways to implement this in Spark as I see:
> 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
> build. Only 1 new public function is required.
> 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator 
> and RandomizedTrainValiationSplit, which can be complicated since we need to 
> deal with the models.
> I'd prefer option 1 as it's much simpler and straightforward. We can support 
> Randomized grid search via some smallest change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237174#comment-16237174
 ] 

yuhao yang commented on SPARK-22427:


Could you please try to increase the stack size, E.g. with -Xss10m ? 

> StackOverFlowError when using FPGrowth
> --
>
> Key: SPARK-22427
> URL: https://issues.apache.org/jira/browse/SPARK-22427
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
> Environment: Centos Linux 3.10.0-327.el7.x86_64
> java 1.8.0.111
> spark 2.2.0
>Reporter: lyt
>Priority: Normal
>
> code part:
> val path = jobConfig.getString("hdfspath")
> val vectordata = sc.sparkContext.textFile(path)
> val finaldata = sc.createDataset(vectordata.map(obj => {
>   obj.split(" ")
> }).filter(arr => arr.length > 0)).toDF("items")
> val fpg = new FPGrowth()
> 
> fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
> val train = fpg.fit(finaldata)
> print(train.freqItemsets.count())
> print(train.associationRules.count())
> train.save("/tmp/FPGModel")
> And encountered following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
>   at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
>   at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
>   at DataMining.testFPG$.main(FPGrowth.scala:36)
>   at DataMining.testFPG.main(FPGrowth.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at 

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-31 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227094#comment-16227094
 ] 

yuhao yang commented on SPARK-13030:


I see. Thanks for the response [~mlnick].

The Estimator is necessary if we want to automatically infer the size.

Then for adding the extra param size or not, I guess it will be useful in the 
case that automatic inference should not be used (E.g. Sampling before 
training). I would vote for adding.



> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-31 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226307#comment-16226307
 ] 

yuhao yang commented on SPARK-13030:


Sorry to jumping in so late. I can see there's been a lot of efforts.

As far as I understand, making the OneHotEncoder an Estimator is essentially to 
fulfill the requirement that we need consistent dimension and mapping for 
OneHotEncoder during training and prediction. 

To achieve the same target, can we just set an optional numCategory: IntParam 
(or call it size) as an parameter for OneHotEncoder ? If set, then all the 
output vector will have the size as numCategory. Any index that's out of the 
bound of numCategory can be resolved by handleInvalid. Comparably, IMO this is 
a much simpler and robust solution. (totally backwards compatible). 


> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22381) Add StringParam that supports valid options

2017-10-28 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22381:
--

 Summary: Add StringParam that supports valid options
 Key: SPARK-22381
 URL: https://issues.apache.org/jira/browse/SPARK-22381
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


During test with https://issues.apache.org/jira/browse/SPARK-22331, I found it 
might be a good idea to include the possible options in a StringParam.

A StringParam extends Param[String] and allow user to specify the valid options 
in Array[String] (case insensitive).

So far it can help achieve three goals:
1. Make the StringParam aware of its possible options and support native 
validations.
2. StringParam can list the supported options when user input wrong value.
3. allow automatic unit test coverage for case-insensitive String param

and IMO it also decrease the code redundancy.

The StringParam is designed to be completely compatible with existing 
Param[String], just adding the extra logic for supporting options, which means 
we don't need to convert all Param[String] to StringParam until we feel 
comfortable to do that.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18755) Add Randomized Grid Search to Spark ML

2017-10-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221800#comment-16221800
 ] 

yuhao yang commented on SPARK-18755:


Thanks for sending the update here. 

Feel free to send a PR as you wish. I'm interested in the topic and can help 
with review. Yet since none of the committers stopped by here, I guess the 
review process will be very long.

> Add Randomized Grid Search to Spark ML
> --
>
> Key: SPARK-18755
> URL: https://issues.apache.org/jira/browse/SPARK-18755
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> Randomized Grid Search  implements a randomized search over parameters, where 
> each setting is sampled from a distribution over possible parameter values. 
> This has two main benefits over an exhaustive search:
> 1. A budget can be chosen independent of the number of parameters and 
> possible values.
> 2. Adding parameters that do not influence the performance does not decrease 
> efficiency.
> Randomized Grid search usually gives similar result as exhaustive search, 
> while the run time for randomized search is drastically lower.
> For more background, please refer to:
> sklearn: http://scikit-learn.org/stable/modules/grid_search.html
> http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
> http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
> https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.
> There're two ways to implement this in Spark as I see:
> 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
> build. Only 1 new public function is required.
> 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator 
> and RandomizedTrainValiationSplit, which can be complicated since we need to 
> deal with the models.
> I'd prefer option 1 as it's much simpler and straightforward. We can support 
> Randomized grid search via some smallest change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22331) Make MLlib string params case-insensitive

2017-10-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215489#comment-16215489
 ] 

yuhao yang commented on SPARK-22331:


Yes, I don't see the change will break any existing code.

> Make MLlib string params case-insensitive
> -
>
> Key: SPARK-22331
> URL: https://issues.apache.org/jira/browse/SPARK-22331
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> Some String params in ML are still case-sensitive, as they are checked by 
> ParamValidators.inArray.
> For consistency in user experience, there should be some general guideline in 
> whether String params in Spark MLlib are case-insensitive or not. 
> I'm leaning towards making all String params case-insensitive where possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not

2017-10-22 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214667#comment-16214667
 ] 

yuhao yang commented on SPARK-22331:


cc [~WeichenXu123]

> Strength consistency for supporting string params: case-insensitive or not
> --
>
> Key: SPARK-22331
> URL: https://issues.apache.org/jira/browse/SPARK-22331
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> Some String params in ML are still case-sensitive, as they are checked by 
> ParamValidators.inArray.
> For consistency in user experience, there should be some general guideline in 
> whether String params in Spark MLlib are case-insensitive or not. 
> I'm leaning towards making all String params case-insensitive where possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not

2017-10-22 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22331:
--

 Summary: Strength consistency for supporting string params: 
case-insensitive or not
 Key: SPARK-22331
 URL: https://issues.apache.org/jira/browse/SPARK-22331
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


Some String params in ML are still case-sensitive, as they are checked by 
ParamValidators.inArray.

For consistency in user experience, there should be some general guideline in 
whether String params in Spark MLlib are case-insensitive or not. 

I'm leaning towards making all String params case-insensitive where possible.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208614#comment-16208614
 ] 

yuhao yang commented on SPARK-22289:


Thanks for the reply. I'll start compose a PR.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207115#comment-16207115
 ] 

yuhao yang commented on SPARK-22289:


cc [~yanboliang] [~dbtsai]

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207063#comment-16207063
 ] 

yuhao yang edited comment on SPARK-22289 at 10/17/17 6:43 AM:
--

Thanks for reporting the issue. Should be a straight-forward fix. Yet maybe we 
should cover it better in release QA.

There're two ways to support this as I see:
1. Support save/load on LogisticRegressionParams, and also adjust the save/load 
in LogisticRegression and LogisticRegressionModel.

2. Directly support Matrix in Param.jsonEncode, similar to what we have done 
for Vector.

IMO we need to collect opinions before sending a fix. Welcome to send other 
options.

I'm leaning towards 2, for simplicity and convenience for other classes. 


was (Author: yuhaoyan):
Thanks for reporting the issue. Should be a straight-forward fix. Yet we should 
not miss this in the Release QA.

Please send response if anyone has already started working on this. Otherwise 
I'll send a fix.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207063#comment-16207063
 ] 

yuhao yang edited comment on SPARK-22289 at 10/17/17 6:28 AM:
--

Thanks for reporting the issue. Should be a straight-forward fix. Yet we should 
not miss this in the Release QA.

Please send response if anyone has already started working on this. Otherwise 
I'll send a fix.


was (Author: yuhaoyan):
Thanks for reporting the issue. Should be a straight-forward fix. Yet we should 
not miss this in the Release QA.

Let send response if anyone has already started working on this. Otherwise I'll 
send a fix.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207063#comment-16207063
 ] 

yuhao yang commented on SPARK-22289:


Thanks for reporting the issue. Should be a straight-forward fix. Yet we should 
not miss this in the Release QA.

Let send response if anyone has already started working on this. Otherwise I'll 
send a fix.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors

2017-10-06 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193844#comment-16193844
 ] 

yuhao yang edited comment on SPARK-22195 at 10/6/17 7:33 AM:
-

Thanks for the feedback.

I don't see the existing implementation (RowMatrix or in Word2Vec) can fulfill 
the two scenarios:
1. Compute cosine similarity between two arbitrary vectors.
2. Compute cosine similarity between one vector and a group of other Vectors 
(usually candidates).

And I'm afraid that not everyone using Spark ML knows how to implement cosine 
similarity. 



was (Author: yuhaoyan):
Thanks for the feedback.

I don't see the existing implementation (RowMatrix or in Word2Vec) can fulfill 
the two scenarios:
1. Compute cosine similarity between two arbitrary vectors.
2. Compute cosine similarity between one vector and a group of other Vectors 
(usually candidates).

And again, not everyone using Spark ML know how to implement cosine similarity. 


> Add cosine similarity to org.apache.spark.ml.linalg.Vectors
> ---
>
> Key: SPARK-22195
> URL: https://issues.apache.org/jira/browse/SPARK-22195
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> https://en.wikipedia.org/wiki/Cosine_similarity:
> As the most important measure of similarity, I found it quite useful in some 
> image and NLP applications according to personal experience.
> Suggest to add function for cosine similarity in 
> org.apache.spark.ml.linalg.Vectors.
> Interface:
>   def cosineSimilarity(v1: Vector, v2: Vector): Double = ...
>   def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): 
> Double = ...
> Appreciate suggestions and need green light from committers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors

2017-10-05 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193844#comment-16193844
 ] 

yuhao yang commented on SPARK-22195:


Thanks for the feedback.

I don't see the existing implementation (RowMatrix or in Word2Vec) can fulfill 
the two scenarios:
1. Compute cosine similarity between two arbitrary vectors.
2. Compute cosine similarity between one vector and a group of other Vectors 
(usually candidates).

And again, not everyone using Spark ML know how to implement cosine similarity. 


> Add cosine similarity to org.apache.spark.ml.linalg.Vectors
> ---
>
> Key: SPARK-22195
> URL: https://issues.apache.org/jira/browse/SPARK-22195
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> https://en.wikipedia.org/wiki/Cosine_similarity:
> As the most important measure of similarity, I found it quite useful in some 
> image and NLP applications according to personal experience.
> Suggest to add function for cosine similarity in 
> org.apache.spark.ml.linalg.Vectors.
> Interface:
>   def cosineSimilarity(v1: Vector, v2: Vector): Double = ...
>   def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): 
> Double = ...
> Appreciate suggestions and need green light from committers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior

2017-10-05 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22210:
--

 Summary: Online LDA variationalTopicInference  should use random 
seed to have stable behavior
 Key: SPARK-22210
 URL: https://issues.apache.org/jira/browse/SPARK-22210
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582

Gamma distribution should use random seed to have consistent behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2017-10-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192217#comment-16192217
 ] 

yuhao yang commented on SPARK-3181:
---

Regarding to whether to separate Huber loss an an independent Estimator, I 
don't see there's an direct conflict.

IMO, LinearRegression should act as an all-in-one Estimator that allow user to 
combine whichever loss function, optimizer and regularization to use. It should 
targets flexibility and also provides some fundamental infrastructure for 
regression algorithms.

In the meantime, we may also support HuberRegression, RidgeRegression and 
others in independent Estimator, which is more convenient but with less 
flexibility (also allow specific parameters). As mentioned by Seth, this would 
require better code abstraction and plugin interface. Besides  
loss/prediction/optimizer, we also need to provide infrastructure for model 
summary and serialization. This should only happen after we can compose 
Estimator like HuberRegression without noticeable code duplication. 


> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Fan Jiang
>Assignee: Yanbo Liang
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors

2017-10-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190884#comment-16190884
 ] 

yuhao yang commented on SPARK-22195:


Exactly, the implementation is straight forward, but I guess not every one 
knows about it. I was asked several times about if Spark supports cosine 
similarity computation and I had to do the expatiation. Just want to see if 
this is a common requirement. 

> Add cosine similarity to org.apache.spark.ml.linalg.Vectors
> ---
>
> Key: SPARK-22195
> URL: https://issues.apache.org/jira/browse/SPARK-22195
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> https://en.wikipedia.org/wiki/Cosine_similarity:
> As the most important measure of similarity, I found it quite useful in some 
> image and NLP applications according to personal experience.
> Suggest to add function for cosine similarity in 
> org.apache.spark.ml.linalg.Vectors.
> Interface:
>   def cosineSimilarity(v1: Vector, v2: Vector): Double = ...
>   def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): 
> Double = ...
> Appreciate suggestions and need green light from committers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors

2017-10-03 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22195:
--

 Summary: Add cosine similarity to 
org.apache.spark.ml.linalg.Vectors
 Key: SPARK-22195
 URL: https://issues.apache.org/jira/browse/SPARK-22195
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


https://en.wikipedia.org/wiki/Cosine_similarity:
As the most important measure of similarity, I found it quite useful in some 
image and NLP applications according to personal experience.

Suggest to add function for cosine similarity in 
org.apache.spark.ml.linalg.Vectors.

Interface:

  def cosineSimilarity(v1: Vector, v2: Vector): Double = ...
  def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): 
Double = ...

Appreciate suggestions and need green light from committers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-10-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190239#comment-16190239
 ] 

yuhao yang commented on SPARK-21866:


My two cents,

1. In most scenarios, deep learning applications use rescaled/cropped images 
(typically 256, 224 or smaller). I would add an extra parameter "smallSideSize" 
to the readImages method, which is more convenient for the users and we don't 
need to cache the image of original size (which could be 100 times larger than 
the scaled image). 

2. Not sure about the reason to include path info into the image data. Based on 
my experience, path info serves better as a separate column in the DataFrame.

3.  After some argumentation and normalization, the image data will be floating 
point numbers rather than the bytes. It's fine if the current format is only 
for reading the image data, but not as the standard image feature exchange 
format in Spark.

4. I don't see the parameter "recursive" as necessary. Existing wild card 
matching provides more functions. 

Part of the image pre-processing code I used (a little stale) is available from 
https://github.com/hhbyyh/SparkDL, just for reference.



> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general 

[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-08-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139023#comment-16139023
 ] 

yuhao yang commented on SPARK-21535:


Thank for for the comments.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-08-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-21535.

Resolution: Not A Problem

The new implementation will load the evaluation dataset when training model and 
may not always present a better performance. Please refer to the discussion in 
the PR.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103547#comment-16103547
 ] 

yuhao yang commented on SPARK-21535:


It's not in my opinion.  https://issues.apache.org/jira/browse/SPARK-21086 is 
trying to store all the trained models in the TrainValidationSplitModel or 
CrossValidatorModel according to the discussion, and with a control parameter 
which is turned off by default. Anyway changing the training process hardly has 
an impact on that.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-26 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100860#comment-16100860
 ] 

yuhao yang edited comment on SPARK-21535 at 7/26/17 6:30 PM:
-

https://github.com/apache/spark/pull/18733


was (Author: yuhaoyan):
https://github.com/apache/spark/pulls 

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-26 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101870#comment-16101870
 ] 

yuhao yang commented on SPARK-21535:


The basic idea is that we should release the driver memory as soon as a trained 
model is evaluated. I don't think there's any conflict. But let me know if 
there's any, I'll revert the jira.

I'm not a big fan for the Parallel CV idea. Personally I cannot see how it 
improves the overall performance or ease of use. But maybe it's just I never 
met the appropriate scenarios.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-25 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100860#comment-16100860
 ] 

yuhao yang commented on SPARK-21535:


https://github.com/apache/spark/pulls 

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-25 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-21535:
---
Description: 
CrossValidator and TrainValidationSplit both use 
{code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
epm is Array[ParamMap].

Even though the training process is sequential, current implementation consumes 
extra driver memory for holding the trained models, which is not necessary and 
often leads to memory exception for both CrossValidator and 
TrainValidationSplit. My proposal is to optimize the training implementation, 
thus that used model can be collected by GC, and avoid the unnecessary OOM 
exceptions.

E.g. when grid search space is 12, old implementation needs to hold all 12 
trained models in the driver memory at the same time, while the new 
implementation only needs to hold 1 trained model at a time, and previous model 
can be cleared by GC.

  was:
CrossValidator and TrainValidationSplit both use 
{code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
epm is Array[ParamMap].

Even though the training process is sequential, current implementation consumes 
extra driver memory for holding the trained models, which is not necessary and 
often leads to memory exception for both CrossValidator and 
TrainValidationSplit. My proposal is to changing the training implementation to 
train one model at a time, thus that used local model can be collected by GC, 
and avoid the unnecessary OOM exceptions.


> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-25 Thread yuhao yang (JIRA)
yuhao yang created SPARK-21535:
--

 Summary: Reduce memory requirement for CrossValidator and 
TrainValidationSplit 
 Key: SPARK-21535
 URL: https://issues.apache.org/jira/browse/SPARK-21535
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


CrossValidator and TrainValidationSplit both use 
{code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
epm is Array[ParamMap].

Even though the training process is sequential, current implementation consumes 
extra driver memory for holding the trained models, which is not necessary and 
often leads to memory exception for both CrossValidator and 
TrainValidationSplit. My proposal is to changing the training implementation to 
train one model at a time, thus that used local model can be collected by GC, 
and avoid the unnecessary OOM exceptions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala

2017-07-25 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100447#comment-16100447
 ] 

yuhao yang commented on SPARK-21087:


Withdrawing my PR, anyone with interests please go ahead and work on this. 

> CrossValidator, TrainValidationSplit should preserve all models after 
> fitting: Scala
> 
>
> Key: SPARK-21087
> URL: https://issues.apache.org/jira/browse/SPARK-21087
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files

2017-07-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099313#comment-16099313
 ] 

yuhao yang commented on SPARK-21524:


https://github.com/apache/spark/pull/18728

> ValidatorParamsSuiteHelpers generates wrong temp files
> --
>
> Key: SPARK-21524
> URL: https://issues.apache.org/jira/browse/SPARK-21524
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the 
> wrong place and does not delete them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files

2017-07-24 Thread yuhao yang (JIRA)
yuhao yang created SPARK-21524:
--

 Summary: ValidatorParamsSuiteHelpers generates wrong temp files
 Key: SPARK-21524
 URL: https://issues.apache.org/jira/browse/SPARK-21524
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the wrong 
place and does not delete them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14239) Add load for LDAModel that supports both local and distributedModel

2017-07-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098948#comment-16098948
 ] 

yuhao yang commented on SPARK-14239:


Close overlooked stale jira.

> Add load for LDAModel that supports both local and distributedModel
> ---
>
> Key: SPARK-14239
> URL: https://issues.apache.org/jira/browse/SPARK-14239
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Add load for LDAModel that supports loading both local and distributedModel, 
> as discussed in https://github.com/apache/spark/pull/9894. So that users 
> don't have to know the details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14239) Add load for LDAModel that supports both local and distributedModel

2017-07-24 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-14239.

Resolution: Won't Do

> Add load for LDAModel that supports both local and distributedModel
> ---
>
> Key: SPARK-14239
> URL: https://issues.apache.org/jira/browse/SPARK-14239
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Add load for LDAModel that supports loading both local and distributedModel, 
> as discussed in https://github.com/apache/spark/pull/9894. So that users 
> don't have to know the details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer

2017-07-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098946#comment-16098946
 ] 

yuhao yang commented on SPARK-12875:


Close stale jira.

> Add Weight of Evidence and Information value to Spark.ml as a feature 
> transformer
> -
>
> Key: SPARK-12875
> URL: https://issues.apache.org/jira/browse/SPARK-12875
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> As a feature transformer, WOE and IV enable one to:
> Consider each variable’s independent contribution to the outcome.
> Detect linear and non-linear relationships.
> Rank variables in terms of "univariate" predictive strength.
> Visualize the correlations between the predictive variables and the binary 
> outcome.
> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives 
> a good introduction to WoE and IV.
>  The Weight of Evidence or WoE value provides a measure of how well a 
> grouping of feature is able to distinguish between a binary response (e.g. 
> "good" versus "bad"), which is widely used in grouping continuous feature or 
> mapping categorical features to continuous values. It is computed from the 
> basic odds ratio:
> (Distribution of positive Outcomes) / (Distribution of negative Outcomes)
> where Distr refers to the proportion of positive or negative in the 
> respective group, relative to the column totals.
> The WoE recoding of features is particularly well suited for subsequent 
> modeling using Logistic Regression or MLP.
> In addition, the information value or IV can be computed based on WoE, which 
> is a popular technique to select variables in a predictive model.
> TODO: Currently we support only calculation for categorical features. Add an 
> estimator to estimate the proper grouping for continuous feature. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer

2017-07-24 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-12875.

Resolution: Won't Do

> Add Weight of Evidence and Information value to Spark.ml as a feature 
> transformer
> -
>
> Key: SPARK-12875
> URL: https://issues.apache.org/jira/browse/SPARK-12875
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> As a feature transformer, WOE and IV enable one to:
> Consider each variable’s independent contribution to the outcome.
> Detect linear and non-linear relationships.
> Rank variables in terms of "univariate" predictive strength.
> Visualize the correlations between the predictive variables and the binary 
> outcome.
> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives 
> a good introduction to WoE and IV.
>  The Weight of Evidence or WoE value provides a measure of how well a 
> grouping of feature is able to distinguish between a binary response (e.g. 
> "good" versus "bad"), which is widely used in grouping continuous feature or 
> mapping categorical features to continuous values. It is computed from the 
> basic odds ratio:
> (Distribution of positive Outcomes) / (Distribution of negative Outcomes)
> where Distr refers to the proportion of positive or negative in the 
> respective group, relative to the column totals.
> The WoE recoding of features is particularly well suited for subsequent 
> modeling using Logistic Regression or MLP.
> In addition, the information value or IV can be computed based on WoE, which 
> is a popular technique to select variables in a predictive model.
> TODO: Currently we support only calculation for categorical features. Add an 
> estimator to estimate the proper grouping for continuous feature. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2017-07-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098940#comment-16098940
 ] 

yuhao yang edited comment on SPARK-14760 at 7/24/17 6:23 PM:
-

Close stale jira since it's been overlooked for some time. Thanks for the 
review and comments.


was (Author: yuhaoyan):
Close it since it's been overlooked for some time. Thanks for the review and 
comments.

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2017-07-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098940#comment-16098940
 ] 

yuhao yang commented on SPARK-14760:


Close it since it's been overlooked for some time. Thanks for the review and 
comments.

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13223) Add stratified sampling to ML feature engineering

2017-07-24 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-13223.

Resolution: Not A Problem

> Add stratified sampling to ML feature engineering
> -
>
> Key: SPARK-13223
> URL: https://issues.apache.org/jira/browse/SPARK-13223
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> I found it useful to add an sampling transformer during a case of fraud 
> detection. It can be used in resampling or overSampling, which in turn is 
> required by ensemble and unbalanced data processing.
> Internally, it invoke the sampleByKey in Pair RDD operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13223) Add stratified sampling to ML feature engineering

2017-07-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098933#comment-16098933
 ] 

yuhao yang commented on SPARK-13223:


Close it since it's been overlooked for some time and can be implemented with 
#17583 easily. 

> Add stratified sampling to ML feature engineering
> -
>
> Key: SPARK-13223
> URL: https://issues.apache.org/jira/browse/SPARK-13223
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> I found it useful to add an sampling transformer during a case of fraud 
> detection. It can be used in resampling or overSampling, which in turn is 
> required by ensemble and unbalanced data processing.
> Internally, it invoke the sampleByKey in Pair RDD operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-07-21 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097062#comment-16097062
 ] 

yuhao yang commented on SPARK-21086:


sure, indices sounds fine.

For the driver memory, especially for CrossValidator, caching all the trained 
models would be impractical and not necessary. Even though all the models are 
collected to the driver, but it's a sequential process. And with the current 
implementation of CrossValidator, GC can kick in and clear all the previous 
models which is especially practical for large models. 

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18724) Add TuningSummary for TrainValidationSplit and CountVectorizer

2017-07-06 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-18724:
---
Summary: Add TuningSummary for TrainValidationSplit and CountVectorizer  
(was: Add TuningSummary for TrainValidationSplit)

> Add TuningSummary for TrainValidationSplit and CountVectorizer
> --
>
> Key: SPARK-18724
> URL: https://issues.apache.org/jira/browse/SPARK-18724
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Currently TrainValidationSplitModel only provides tuning metrics in the 
> format of Array[Double], which makes it harder for tying the metrics back to 
> the paramMap generating them and affects the usefulness for the tuning 
> framework.
> Add a Tuning Summary to provide better presentation for the tuning metrics, 
> for now the idea is to use a DataFrame listing all the params and 
> corresponding metrics.
> The Tuning Summary Class can be further extended for CrossValidator.
> Refer to https://issues.apache.org/jira/browse/SPARK-18704 for more related 
> discussion



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2017-07-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073987#comment-16073987
 ] 

yuhao yang edited comment on SPARK-11069 at 7/4/17 6:32 PM:


   [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be 
consistent with Tokenizer and accommodate the general user scenarios. The 
change of behavior was documented in the release notes of 1.6. 
https://spark.apache.org/releases/spark-release-1-6-0.html

You can disable it by setting toLowerCase to false.
 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*



was (Author: yuhaoyan):
   [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be 
consistent with Tokenizer and accommodate the general user scenarios. The 
change of behavior was documented in the release notes of 1.6. 
https://spark.apache.org/releases/spark-release-1-6-0.html

 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*


> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2017-07-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073987#comment-16073987
 ] 

yuhao yang edited comment on SPARK-11069 at 7/4/17 6:31 PM:


   [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be 
consistent with Tokenizer and accommodate the general user scenarios. The 
change of behavior was documented in the release notes of 1.6. 
https://spark.apache.org/releases/spark-release-1-6-0.html

 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*



was (Author: yuhaoyan):
   [~levente.torok.ge] use

 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*


> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2017-07-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073987#comment-16073987
 ] 

yuhao yang commented on SPARK-11069:


   [~levente.torok.ge] use

 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*


> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-06-30 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070883#comment-16070883
 ] 

yuhao yang commented on SPARK-20082:


I'm OK with only supporting initialModel for Online LDA now. For EM LDA, an 
initial model is also possible, but we may need some extra check depending on 
if EM can fit on new documents.

I'll make a pass on the current implementation. But we still need the opinion 
and final check from [~josephkb] or other committers.

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu DESPRIEE
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19053) Supporting multiple evaluation metrics in DataFrame-based API: discussion

2017-06-30 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070849#comment-16070849
 ] 

yuhao yang commented on SPARK-19053:


Not sure if this is still wanted. cc [~josephkb]
And I'd like to understand if this jira is about performance improvement or API 
refine. Evaluator classes in ml basically invoke the mllib implementation and 
compute the metrics in one pass as I understand. 
Will this change the return type of the Evaluator.evaluate() method? Currently 
it's Double. 


> Supporting multiple evaluation metrics in DataFrame-based API: discussion
> -
>
> Key: SPARK-19053
> URL: https://issues.apache.org/jira/browse/SPARK-19053
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is to discuss supporting the computation of multiple evaluation 
> metrics efficiently in the DataFrame-based API for MLlib.
> In the RDD-based API, RegressionMetrics and other *Metrics classes support 
> efficient computation of multiple metrics.
> In the DataFrame-based API, there are a few options:
> * model/result summaries (e.g., LogisticRegressionSummary): These currently 
> provide the desired functionality, but they require a model and do not let 
> users compute metrics manually from DataFrames of predictions and true labels.
> * Evaluator classes (e.g., RegressionEvaluator): These only support computing 
> a single metric in one pass over the data, but they do not require a model.
> * new class analogous to Metrics: We could introduce a class analogous to 
> Metrics.  Model/result summaries could use this internally as a replacement 
> for spark.mllib Metrics classes, or they could (maybe) inherit from these 
> classes.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml

2017-06-28 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067494#comment-16067494
 ] 

yuhao yang commented on SPARK-18441:


Move the Smote code to 
https://gist.github.com/hhbyyh/346467373014943a7f20df208caeb19b

> Add Smote in spark mlib and ml
> --
>
> Key: SPARK-18441
> URL: https://issues.apache.org/jira/browse/SPARK-18441
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> PLZ Add Smote in spark mlib and ml in case of  the "not balance of train 
> data" for Classification



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator

2017-06-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065694#comment-16065694
 ] 

yuhao yang commented on SPARK-21152:


This is something that we should investigate anyway. 

By GEMM, do you mean you will treat the coefficients as a Matrix even it's 
actually a vector? Before the implementation, I think it's necessary to check 
the GEMM speedup when multiplying matrix and vector, which could be quite 
different from normal GEMM.

> Use level 3 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-21152
> URL: https://issues.apache.org/jira/browse/SPARK-21152
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Seth Hendrickson
>
> In logistic regression gradient update, we currently compute by each 
> individual row. If we blocked the rows together, we can do a blocked gradient 
> update which leverages the BLAS GEMM operation.
> On high dimensional dense datasets, I've observed ~10x speedups. The problem 
> here, though, is that it likely won't improve the sparse case so we need to 
> keep both implementations around, and this blocked algorithm will require 
> caching a new dataset of type:
> {code}
> BlockInstance(label: Vector, weight: Vector, features: Matrix)
> {code}
> We have avoided caching anything beside the original dataset passed to train 
> in the past because it adds memory overhead if the user has cached this 
> original dataset for other reasons. Here, I'd like to discuss whether we 
> think this patch would be worth the investment, given that it only improves a 
> subset of the use cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21108) convert LinearSVC to aggregator framework

2017-06-15 Thread yuhao yang (JIRA)
yuhao yang created SPARK-21108:
--

 Summary: convert LinearSVC to aggregator framework
 Key: SPARK-21108
 URL: https://issues.apache.org/jira/browse/SPARK-21108
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala

2017-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048723#comment-16048723
 ] 

yuhao yang commented on SPARK-21087:


I'd like to work on this if my 
[comment|https://issues.apache.org/jira/browse/SPARK-21086?focusedCommentId=16048647=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16048647]
 looks reasonable.


> CrossValidator, TrainValidationSplit should preserve all models after 
> fitting: Scala
> 
>
> Key: SPARK-21087
> URL: https://issues.apache.org/jira/browse/SPARK-21087
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048647#comment-16048647
 ] 

yuhao yang edited comment on SPARK-21086 at 6/14/17 5:22 AM:
-

Sounds good. About the default path for saving different models, how about we 
use the flatten parameter as the file name. 
e.g. LogisticRegressionModel-maxIter-100-regParam-0.1

And I would not implement it with the ML Persistence Framework, simply because 
caching the models in memory would be expensive (especially impractical for 
driver memory) and would impact the existing usage of CrossValidator (Slower or 
OOM). I would recommend adding an expert param and save the models during 
training.


was (Author: yuhaoyan):
Sounds good. About the default path for saving different models, how about we 
use the flatten parameter as the file name. 
e.g. LogisticRegressionModel-maxIter-100-regParam-0.1

And I would not implement it with the ML Persistence Framework, simply because 
caching the models in memory would be expensive and would impact the existing 
usage of CrossValidator (Slower or OOM). I would recommend adding an expert 
param and save the models during training.

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048647#comment-16048647
 ] 

yuhao yang edited comment on SPARK-21086 at 6/14/17 5:12 AM:
-

Sounds good. About the default path for saving different models, how about we 
use the flatten parameter as the file name. 
e.g. LogisticRegressionModel-maxIter-100-regParam-0.1

And I would not implement it with the ML Persistence Framework, simply because 
caching the models in memory would be expensive and would impact the existing 
usage of CrossValidator (Slower or OOM). I would recommend adding an expert 
param and save the models during training.


was (Author: yuhaoyan):
Sounds good. About the default path for saving different models, how about we 
use the flatten parameter as the file name. 
e.g. LogisticRegressionModel-maxIter-100-regParam-0.1

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048698#comment-16048698
 ] 

yuhao yang commented on SPARK-20988:


Eh.. I was trying to add the squared_hinge loss to LinearSVC and already 
converted LinearSVC to use the aggregator framework in SPARK-20602  
https://github.com/apache/spark/pull/17862.

cc [~VinceXie]

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC

2017-06-13 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-20348.

Resolution: Duplicate

Combine it with SPARK-20602 and resolve this as duplicate.

> Support squared hinge loss (L2 loss) for LinearSVC
> --
>
> Key: SPARK-20348
> URL: https://issues.apache.org/jira/browse/SPARK-20348
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> While Hinge loss is the standard loss function for linear SVM, Squared hinge 
> loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable 
> and imposes a bigger (quadratic vs. linear) loss for points which violate the 
> margin. Some introduction can be found from 
> http://mccormickml.com/2015/01/06/what-is-an-l2-svm/
> Liblinear and [scikit 
> learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html]
>  both offer squared hinge loss as the default loss function for linear SVM. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC

2017-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048663#comment-16048663
 ] 

yuhao yang commented on SPARK-20602:


Combining this with SPARK-20348. Support squared hinge loss (L2 loss) for 
LinearSVC. And close SPARK-20348

> Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
> |news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
> |mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
> |madelon|143(0.75) | 8129(0.70)| 823(0.74) |
> |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
> |phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
> |a7a | 237 (0.84) | 372(0.84) | 69(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC

2017-06-13 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20602:
---
Summary: Adding LBFGS optimizer and Squared_hinge loss for LinearSVC  (was: 
Adding LBFGS as optimizer for LinearSVC)

> Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
> |news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
> |mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
> |madelon|143(0.75) | 8129(0.70)| 823(0.74) |
> |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
> |phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
> |a7a | 237 (0.84) | 372(0.84) | 69(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048647#comment-16048647
 ] 

yuhao yang commented on SPARK-21086:


Sounds good. About the default path for saving different models, how about we 
use the flatten parameter as the file name. 
e.g. LogisticRegressionModel-maxIter-100-regParam-0.1

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20602) Adding LBFGS as optimizer for LinearSVC

2017-06-13 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20602:
---
Description: 
Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
LBFGS and OWLQN on several public dataset and found LBFGS converges much faster 
for LinearSVC in most cases.

The following table presents the number of training iterations and f1 score of 
both optimizers until convergence

||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
|news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
|mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
|madelon|143(0.75) | 8129(0.70)| 823(0.74) |
|breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
|phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
|a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
|a7a | 237 (0.84) | 372(0.84) | 69(0.84) |

data source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
training code: new LinearSVC().setMaxIter(1).setTol(1e-6)

LBFGS requires less iterations in most cases (except for a1a) and probably is a 
better default optimizer. 



  was:
Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
LBFGS and OWLQN on several public dataset and found LBFGS converges much faster 
for LinearSVC in most cases.

The following table presents the number of training iterations and f1 score of 
both optimizers until convergence

||Dataset||LBFGS||OWLQN||
|news20.binary| 31 (0.99) | 413(0.99) |
|mushroom| 28(1.0) | 170(1.0)|
|madelon|143(0.75) | 8129(0.70)|
|breast-cancer-scale| 15(1.0) | 16(1.0)|
|phishing | 329(0.94) | 231(0.94) |
|a1a(adult) | 466 (0.87) | 282 (0.87) |
|a7a | 237 (0.84) | 372(0.84) |

data source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
training code: new LinearSVC().setMaxIter(1).setTol(1e-6)

LBFGS requires less iterations in most cases (except for a1a) and probably is a 
better default optimizer. 




> Adding LBFGS as optimizer for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
> |news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
> |mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
> |madelon|143(0.75) | 8129(0.70)| 823(0.74) |
> |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
> |phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
> |a7a | 237 (0.84) | 372(0.84) | 69(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-05-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022379#comment-16022379
 ] 

yuhao yang commented on SPARK-20082:


refer to https://issues.apache.org/jira/browse/SPARK-20767 for some insights 
shared by [~cezden]
{quote}
Technical aspects:
1. The implementation of LDA fitting does not currently allow the coefficients 
pre-setting (private setter), as noted by a comment in the source code of 
OnlineLDAOptimizer.setLambda: "This is only used for testing now. In the 
future, it can help support training stop/resume".
2. The lambda matrix is always randomly initialized by the optimizer, which 
needs fixing for preset lambda matrix.
{quote}

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20767) The training continuation for saved LDA model

2017-05-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022375#comment-16022375
 ] 

yuhao yang commented on SPARK-20767:


Note there's already an issue about setInitialModel in 
https://issues.apache.org/jira/browse/SPARK-20082. [~cezden] Thanks for sharing 
your insight for onlineLDA. Appreciate if you can help review or contribute. 

> The training continuation for saved LDA model
> -
>
> Key: SPARK-20767
> URL: https://issues.apache.org/jira/browse/SPARK-20767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cezary Dendek
>Priority: Minor
>
> Current online implementation of the LDA model fit (OnlineLDAOptimizer) does 
> not support the model update (ie. to account for the population/covariates 
> drift) nor the continuation of model fitting in case of the insufficient 
> number of iterations.
> Technical aspects:
> 1. The implementation of LDA fitting does not currently allow the 
> coefficients pre-setting (private setter), as noted by a comment in the 
> source code of OnlineLDAOptimizer.setLambda: "This is only used for testing 
> now. In the future, it can help support training stop/resume".
> 2. The lambda matrix is always randomly initialized by the optimizer, which 
> needs fixing for preset lambda matrix.
> The adaptation of the classes by the user is not possible due to protected 
> setters & sealed / final classes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20864) I tried to run spark mllib PIC algorithm, but got error

2017-05-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022345#comment-16022345
 ] 

yuhao yang commented on SPARK-20864:


[~yuanjie] Could you please provide more code to help the investigation? From 
the exception it looks like the issue is not caused by the algorithm, but 
something in the data processing.

> I tried to run spark mllib PIC algorithm, but got error
> ---
>
> Key: SPARK-20864
> URL: https://issues.apache.org/jira/browse/SPARK-20864
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: yuanjie
>Priority: Blocker
>
> I use a very simple data:
> 1 2 3
> 2 1 3
> 3 1 3
> 4 5 2
> 4 6 2
> 5 6 2
> but when running I got:
> Exception in thread "main" : java.io.IOException: 
> com.google.protobuf.ServiceException: java.lang.UnsupportedOperationException 
> :This is supposed to be overridden by subclasses
> why?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param

2017-05-18 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016116#comment-16016116
 ] 

yuhao yang commented on SPARK-20768:


Thanks for the ping. [~mlnick] We should just treat it as an expert param. 
Normally in python it should be exposed as a Param in my impression.

> PySpark FPGrowth does not expose numPartitions (expert)  param
> --
>
> Key: SPARK-20768
> URL: https://issues.apache.org/jira/browse/SPARK-20768
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. 
> While it is an "expert" param, the general approach elsewhere is to expose 
> these on the Python side (e.g. {{aggregationDepth}} and intermediate storage 
> params in {{ALS}})



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-18 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016061#comment-16016061
 ] 

yuhao yang commented on SPARK-20797:


[~d0evi1] Thanks for reporting the issue and proposal for the fix. Would you 
send a PR for the fix? 

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20670) Simplify FPGrowth transform

2017-05-08 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20670:
--

 Summary: Simplify FPGrowth transform
 Key: SPARK-20670
 URL: https://issues.apache.org/jira/browse/SPARK-20670
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the 
transform code in FPGrowthModel can be simplified. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20602) Adding LBFGS as optimizer for LinearSVC

2017-05-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997314#comment-15997314
 ] 

yuhao yang commented on SPARK-20602:


cc [~josephkb]

> Adding LBFGS as optimizer for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS||OWLQN||
> |news20.binary| 31 (0.99) | 413(0.99) |
> |mushroom| 28(1.0) | 170(1.0)|
> |madelon|143(0.75) | 8129(0.70)|
> |breast-cancer-scale| 15(1.0) | 16(1.0)|
> |phishing | 329(0.94) | 231(0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) |
> |a7a | 237 (0.84) | 372(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20602) Adding LBFGS as optimizer for LinearSVC

2017-05-04 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20602:
--

 Summary: Adding LBFGS as optimizer for LinearSVC
 Key: SPARK-20602
 URL: https://issues.apache.org/jira/browse/SPARK-20602
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
LBFGS and OWLQN on several public dataset and found LBFGS converges much faster 
for LinearSVC in most cases.

The following table presents the number of training iterations and f1 score of 
both optimizers until convergence

||Dataset||LBFGS||OWLQN||
|news20.binary| 31 (0.99) | 413(0.99) |
|mushroom| 28(1.0) | 170(1.0)|
|madelon|143(0.75) | 8129(0.70)|
|breast-cancer-scale| 15(1.0) | 16(1.0)|
|phishing | 329(0.94) | 231(0.94) |
|a1a(adult) | 466 (0.87) | 282 (0.87) |
|a7a | 237 (0.84) | 372(0.84) |

data source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
training code: new LinearSVC().setMaxIter(1).setTol(1e-6)

LBFGS requires less iterations in most cases (except for a1a) and probably is a 
better default optimizer. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20526) Load doesn't work in PCAModel

2017-04-28 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989591#comment-15989591
 ] 

yuhao yang commented on SPARK-20526:


Can you please provide more context? like which version of Spark did you use 
for saving and loading respectively. And perhaps share the save/load code. You 
can also check the explainedVariance in PCAModel to see if it's null.

> Load doesn't work in PCAModel 
> --
>
> Key: SPARK-20526
> URL: https://issues.apache.org/jira/browse/SPARK-20526
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
> Environment: Windows
>Reporter: Hayri Volkan Agun
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Error occurs during loading PCAModel. Saved model doesn't load.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20502) ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final, sealed audit

2017-04-28 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989317#comment-15989317
 ] 

yuhao yang commented on SPARK-20502:


Check here https://issues.apache.org/jira/browse/SPARK-18319 for previous 
discussion. I updated the list according to the change we made last release. So 
far I don't think we need to make any change about the sealed and experimental 
API. But I listed some final class we have in ml which may be ready to be 
unmarked. 

sealed: 
org.apache.spark.ml.attribute.Attribute
org.apache.spark.ml.attribute.AttributeType
org.apache.spark.ml.classification.LogisticRegressionTrainingSummary
org.apache.spark.ml.classification.LogisticRegressionSummary
org.apache.spark.ml.feature.Term
org.apache.spark.ml.feature.InteractableTerm
org.apache.spark.ml.optim.WeightedLeastSquares.Solver
org.apache.spark.ml.optim.NormalEquationSolver
org.apache.spark.ml.tree.Node
org.apache.spark.ml.tree.Split
org.apache.spark.ml.util.BaseReadWrite
org.apache.spark.ml.linalg.Matrix
org.apache.spark.ml.linalg.Vector
org.apache.spark.mllib.stat.test.StreamingTestMethod
org.apache.spark.mllib.tree.model.TreeEnsembleModel

Experimental:
org.apache.spark.ml.classification.LinearSVC
org.apache.spark.ml.classification.LinearSVCModel
org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary
org.apache.spark.ml.classification.BinaryLogisticRegressionSummary
org.apache.spark.ml.clustering.ClusteringSummary
org.apache.spark.ml.clustering.BisectingKMeansSummary
org.apache.spark.ml.clustering.GaussianMixtureSummary
org.apache.spark.ml.clustering.KMeansSummary
org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
org.apache.spark.ml.evaluation.RegressionEvaluator
org.apache.spark.ml.feature.BucketedRandomProjectionLSH(Model)
org.apache.spark.ml.feature.Imputer(Model)
org.apache.spark.ml.feature.MinHash(Model)
org.apache.spark.ml.feature.RFormula(Model)
org.apache.spark.ml.fpm.FPGrowth(Model)
org.apache.spark.ml.regression.AFTSurvivalRegression(Model)
org.apache.spark.ml.regression.GeneralizedLinearRegression(Model) and summary
org.apache.spark.ml.regression.LinearRegressionTrainingSummary
org.apache.spark.ml.stat.ChiSquareTest
org.apache.spark.ml.stat.ChiSquareTest

Developer API
Most developer API are the basic components for ML pipeline, such like 
Transformer, Estimator, PipelineStage, Params and Attributes, which I don't see 
necessary to change.

final class:
org.apache.spark.ml.classification.OneVsRest
org.apache.spark.ml.evaluation.RegressionEvaluator
org.apache.spark.ml.feature.Binarizer
org.apache.spark.ml.feature.Bucketizer
org.apache.spark.ml.feature.ChiSqSelector
org.apache.spark.ml.feature.IDF
org.apache.spark.ml.feature.QuantileDiscretizer
org.apache.spark.ml.feature.VectorSlicer
org.apache.spark.ml.feature.Word2Vec
org.apache.spark.ml.param.ParamMap

Most of the final class here should be ready to be unmarked. I also checked 
final method and fields (most params) which can be kept the same for now.





> ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-20502
> URL: https://issues.apache.org/jira/browse/SPARK-20502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20351) Add trait hasTrainingSummary to replace the duplicate code

2017-04-16 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20351:
--

 Summary: Add trait hasTrainingSummary to replace the duplicate code
 Key: SPARK-20351
 URL: https://issues.apache.org/jira/browse/SPARK-20351
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


Add a trait HasTrainingSummary to avoid code duplicate related to training 
summary. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC

2017-04-15 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20348:
--

 Summary: Support squared hinge loss (L2 loss) for LinearSVC
 Key: SPARK-20348
 URL: https://issues.apache.org/jira/browse/SPARK-20348
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


While Hinge loss is the standard loss function for linear SVM, Squared hinge 
loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable and 
imposes a bigger (quadratic vs. linear) loss for points which violate the 
margin. Some introduction can be found from 
http://mccormickml.com/2015/01/06/what-is-an-l2-svm/

Liblinear and [scikit 
learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html]
 both offer squared hinge loss as the default loss function for linear SVM. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7128) Add generic bagging algorithm to spark.ml

2017-04-11 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965121#comment-15965121
 ] 

yuhao yang commented on SPARK-7128:
---

I would vote for adding this now. 

This is quite helpful in practical applications like fraud detection, and 
feynmanliang has started with a solid prototype. I can help finish it if this 
is on the roadmap.

> Add generic bagging algorithm to spark.ml
> -
>
> Key: SPARK-7128
> URL: https://issues.apache.org/jira/browse/SPARK-7128
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Bagging algorithm 
> which can work with any Classifier or Regressor.  Creating this feature will 
> require researching the possible variants and extensions of bagging which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation

2017-04-09 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20271:
---
Description: 
Just to share some code I implemented to help easily create a custom 
Transformer in one line of code w.
{code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
else 0) {code}

This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). 
The transformer can be saved/loaded as other transformer and can be integrated 
into a pipeline normally.  It can be used widely in many use cases like 
conditional conversion(if...else...), , type conversion, to/from Array, to/from 
Vector and many string ops..




  was:
Just to share some code I implemented to help easily create a custom 
Transformer in one line of code w.
{code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
else 0) {code}

This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). 
The transformer can be saved/loaded as other transformer and can be integrated 
into a pipeline normally. It can be used widely in many use cases and you can 
find some examples in the PR.





> Add FuncTransformer to simplify custom transformer creation
> ---
>
> Key: SPARK-20271
> URL: https://issues.apache.org/jira/browse/SPARK-20271
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Just to share some code I implemented to help easily create a custom 
> Transformer in one line of code w.
> {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
> else 0) {code}
> This was used in many of my projects and is pretty helpful (Maybe I'm 
> lazy..). The transformer can be saved/loaded as other transformer and can be 
> integrated into a pipeline normally.  It can be used widely in many use cases 
> like conditional conversion(if...else...), , type conversion, to/from Array, 
> to/from Vector and many string ops..



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation

2017-04-09 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20271:
--

 Summary: Add FuncTransformer to simplify custom transformer 
creation
 Key: SPARK-20271
 URL: https://issues.apache.org/jira/browse/SPARK-20271
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


Just to share some code I implemented to help easily create a custom 
Transformer in one line of code w.
{code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 
else 0) {code}

This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). 
The transformer can be saved/loaded as other transformer and can be integrated 
into a pipeline normally. It can be used widely in many use cases and you can 
find some examples in the PR.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-04-06 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959368#comment-15959368
 ] 

yuhao yang commented on SPARK-20082:


Sorry I'm occupied by some internal project this week. I'll find some time to 
look into it this weekend or early next week.

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955705#comment-15955705
 ] 

yuhao yang commented on SPARK-20203:


[~Syrux] Since you got some experiences using the PrefixSpan, I'd like to have 
your input (or better contribution) in 
https://issues.apache.org/jira/browse/SPARK-20114 . 

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20180) Unlimited max pattern length in Prefix span

2017-04-01 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952377#comment-15952377
 ] 

yuhao yang edited comment on SPARK-20180 at 4/1/17 8:14 PM:


I assume user can achieve the same effect by setting maxPatternlength to a 
larger value. So the jira is really about changing the default behavior of 
PrefixSpan. 
Is there more background or context available, like why the current default 
length(10) is not good in practice? Thanks. We need to also consider the 
performance for larger dataset (in count and dimension).


was (Author: yuhaoyan):
I assume user can achieve the same effect by setting maxPatternlength to a 
larger value. So the jira is really about changing the default behavior of 
PrefixSpan. 
Is there more background or context available, like why the current default 
length(10) is not good in practice? Thanks.

> Unlimited max pattern length in Prefix span
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20180) Unlimited max pattern length in Prefix span

2017-04-01 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952377#comment-15952377
 ] 

yuhao yang commented on SPARK-20180:


I assume user can achieve the same effect by setting maxPatternlength to a 
larger value. So the jira is really about changing the default behavior of 
PrefixSpan. 
Is there more background or context available, like why the current default 
length(10) is not good in practice? Thanks.

> Unlimited max pattern length in Prefix span
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944239#comment-15944239
 ] 

yuhao yang edited comment on SPARK-20114 at 3/27/17 11:42 PM:
--

Currently I prefer to implement the dummy PrefixSpanModel as the sequential 
rules extracted won't be quite useful. And if needed, we can implement other 
algorithms to extract sequential rules for prediction.


was (Author: yuhaoyan):
Currently I prefer to implement the dummy PrefixSpanModel as the sequential 
rules extracted won't be quite useful. 

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944239#comment-15944239
 ] 

yuhao yang commented on SPARK-20114:


Currently I prefer to implement the dummy PrefixSpanModel as the sequential 
rules extracted won't be quite useful. 

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20114:
---
Description: 
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R API will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. The PrefixSpanModel is only used to provide 
access for frequent sequential patterns.
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 


  was:
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R API will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 



> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  

[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20114:
---
Description: 
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R API will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 


  was:
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 



> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. 
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not 

[jira] [Created] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-03-27 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20114:
--

 Summary: spark.ml parity for sequential pattern mining - PrefixSpan
 Key: SPARK-20114
 URL: https://issues.apache.org/jira/browse/SPARK-20114
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
 #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
 #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major

2017-03-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943857#comment-15943857
 ] 

yuhao yang commented on SPARK-20083:


So the result array will allow users to manipulate the matrix values. Is it 
intentional?

> Change matrix toArray to not create a new array when matrix is already column 
> major
> ---
>
> Key: SPARK-20083
> URL: https://issues.apache.org/jira/browse/SPARK-20083
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> {{toArray}} always creates a new array in column major format, even when the 
> resulting array is the same as the backing values. We should change this to 
> just return a reference to the values array when it is already column major.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   >