[jira] [Commented] (SPARK-22974) CountVectorModel does not attach attributes to output column
[ https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832691#comment-16832691 ] yuhao yang commented on SPARK-22974: On a business trip from April 29th to May 3rd . Please expect delayed email response. Conctact +1 669 243 8273for anything urgent. Thanks, Yuhao > CountVectorModel does not attach attributes to output column > > > Key: SPARK-22974 > URL: https://issues.apache.org/jira/browse/SPARK-22974 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: William Zhang >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > If CountVectorModel transforms columns, the output column will not have > attributes attached to it. If later on, those output columns are used in > Interaction transformer, an exception will be thrown: > {quote}"org.apache.spark.SparkException: Vector attributes must be defined > for interaction." > {quote} > To reproduce it: > {quote}import org.apache.spark.ml.feature._ > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > (0, Array("a", "b", "c"), Array("1", "2")), > (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) > )).toDF("id", "words", "nums") > val cvModel: CountVectorizerModel = new CountVectorizer() > .setInputCol("nums") > .setOutputCol("features2") > .setVocabSize(4) > .setMinDF(0) > .fit(df) > val cvm = new CountVectorizerModel(Array("a", "b", "c")) > .setInputCol("words") > .setOutputCol("features1") > val df1 = cvm.transform(df) > val df2 = cvModel.transform(df1) > val interaction = new Interaction().setInputCols(Array("features1", > "features2")).setOutputCol("features") > val df3 = interaction.transform(df2) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791938#comment-16791938 ] yuhao yang commented on SPARK-20082: Yuhao is taking family bonding leave from March 7th to Apr 19th . Please expect delayed email response. Conctact +86 13738085700 for anything urgent. Thanks, Yuhao > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu DESPRIEE >Priority: Major > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25011) Add PrefixSpan to __all__ in fpm.py
[ https://issues.apache.org/jira/browse/SPARK-25011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-25011: --- Summary: Add PrefixSpan to __all__ in fpm.py (was: Add PrefixSpan to __all__) > Add PrefixSpan to __all__ in fpm.py > --- > > Key: SPARK-25011 > URL: https://issues.apache.org/jira/browse/SPARK-25011 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 2.4.0 > > > Add PrefixSpan to __all__ in fpm.py -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25011) Add PrefixSpan to __all__
yuhao yang created SPARK-25011: -- Summary: Add PrefixSpan to __all__ Key: SPARK-25011 URL: https://issues.apache.org/jira/browse/SPARK-25011 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.0 Reporter: yuhao yang Add PrefixSpan to __all__ in fpm.py -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23742) Filter out redundant AssociationRules
[ https://issues.apache.org/jira/browse/SPARK-23742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566326#comment-16566326 ] yuhao yang commented on SPARK-23742: [~maropu] Can you be more specific about the suggestion? E.g. how would it work with the example in the description. Thanks > Filter out redundant AssociationRules > - > > Key: SPARK-23742 > URL: https://issues.apache.org/jira/browse/SPARK-23742 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > AssociationRules can generate redundant rules such as: > * (A) => C > * (A,B) => C (redundant) > It should optionally filter out redundant rules. It'd be nice to have it > optional (but maybe defaulting to filtering) so that users could compare the > confidences of more general vs. more specific rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23742) Filter out redundant AssociationRules
[ https://issues.apache.org/jira/browse/SPARK-23742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564858#comment-16564858 ] yuhao yang commented on SPARK-23742: The redundant rule may have different confidence and support. > Filter out redundant AssociationRules > - > > Key: SPARK-23742 > URL: https://issues.apache.org/jira/browse/SPARK-23742 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > AssociationRules can generate redundant rules such as: > * (A) => C > * (A,B) => C (redundant) > It should optionally filter out redundant rules. It'd be nice to have it > optional (but maybe defaulting to filtering) so that users could compare the > confidences of more general vs. more specific rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15064) Locale support in StopWordsRemover
[ https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502929#comment-16502929 ] yuhao yang commented on SPARK-15064: Yuhao will be OOF from May 29th to June 6th (annual leave and conference). Please expect delayed email response. Conctact 669 243 8273 for anything urgent. Regards, Yuhao > Locale support in StopWordsRemover > -- > > Key: SPARK-15064 > URL: https://issues.apache.org/jira/browse/SPARK-15064 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We support case insensitive filtering (default) in StopWordsRemover. However, > case insensitive matching depends on the locale and region, which cannot be > explicitly set in StopWordsRemover. We should consider adding this support in > MLlib. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes
[ https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328310#comment-16328310 ] yuhao yang commented on SPARK-22943: Thanks for the reply, yet I cannot see how can user specify the output dimension right now. > OneHotEncoder supports manual specification of categorySizes > > > Key: SPARK-22943 > URL: https://issues.apache.org/jira/browse/SPARK-22943 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > OHE should support configurable categorySizes, as n-values in > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. > which allows consistent and foreseeable conversion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes
[ https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314412#comment-16314412 ] yuhao yang commented on SPARK-22943: Feel free to work on this but I would suggest to get green light from committer first. > OneHotEncoder supports manual specification of categorySizes > > > Key: SPARK-22943 > URL: https://issues.apache.org/jira/browse/SPARK-22943 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > OHE should support configurable categorySizes, as n-values in > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. > which allows consistent and foreseeable conversion. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes
yuhao yang created SPARK-22943: -- Summary: OneHotEncoder supports manual specification of categorySizes Key: SPARK-22943 URL: https://issues.apache.org/jira/browse/SPARK-22943 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor OHE should support configurable categorySizes, as n-values in http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. which allows consistent and foreseeable conversion. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19053) Supporting multiple evaluation metrics in DataFrame-based API: discussion
[ https://issues.apache.org/jira/browse/SPARK-19053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297887#comment-16297887 ] yuhao yang commented on SPARK-19053: Plan for further development: 1. Initial API and function parity with ML Evaluators. (This PR) 2. Python API. 3. Function parity with MLlib Metrics. 4. Add requested enhancements like including weight, add per-row metrics, add ranking metrics. 5. Reorganize classification Metrics hierarchy, so Binary Classification Metrics can support metrics in MultiClassMetrics (accuracy, recall etc.). 6. Possibly to be used in training summary. > Supporting multiple evaluation metrics in DataFrame-based API: discussion > - > > Key: SPARK-19053 > URL: https://issues.apache.org/jira/browse/SPARK-19053 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is to discuss supporting the computation of multiple evaluation > metrics efficiently in the DataFrame-based API for MLlib. > In the RDD-based API, RegressionMetrics and other *Metrics classes support > efficient computation of multiple metrics. > In the DataFrame-based API, there are a few options: > * model/result summaries (e.g., LogisticRegressionSummary): These currently > provide the desired functionality, but they require a model and do not let > users compute metrics manually from DataFrames of predictions and true labels. > * Evaluator classes (e.g., RegressionEvaluator): These only support computing > a single metric in one pass over the data, but they do not require a model. > * new class analogous to Metrics: We could introduce a class analogous to > Metrics. Model/result summaries could use this internally as a replacement > for spark.mllib Metrics classes, or they could (maybe) inherit from these > classes. > Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275723#comment-16275723 ] yuhao yang commented on SPARK-8418: --- second Nick's comments. > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22331) Make MLlib string params case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269169#comment-16269169 ] yuhao yang commented on SPARK-22331: Thanks for the interests [~smurakozi]. I tried to support this with StringParams (check related jira) but it's not getting any feedback. So feel free to start with other options. > Make MLlib string params case-insensitive > - > > Key: SPARK-22331 > URL: https://issues.apache.org/jira/browse/SPARK-22331 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > Some String params in ML are still case-sensitive, as they are checked by > ParamValidators.inArray. > For consistency in user experience, there should be some general guideline in > whether String params in Spark MLlib are case-insensitive or not. > I'm leaning towards making all String params case-insensitive where possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259587#comment-16259587 ] yuhao yang commented on SPARK-22427: I tried with larger scale data but did not repro the issue. [~lyt] Can you please provide the reference for your dataset, or some size info? Thanks. > StackOverFlowError when using FPGrowth > -- > > Key: SPARK-22427 > URL: https://issues.apache.org/jira/browse/SPARK-22427 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 > Environment: Centos Linux 3.10.0-327.el7.x86_64 > java 1.8.0.111 > spark 2.2.0 >Reporter: lyt > > code part: > val path = jobConfig.getString("hdfspath") > val vectordata = sc.sparkContext.textFile(path) > val finaldata = sc.createDataset(vectordata.map(obj => { > obj.split(" ") > }).filter(arr => arr.length > 0)).toDF("items") > val fpg = new FPGrowth() > > fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) > val train = fpg.fit(finaldata) > print(train.freqItemsets.count()) > print(train.associationRules.count()) > train.save("/tmp/FPGModel") > And encountered following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) > at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) > at DataMining.testFPG$.main(FPGrowth.scala:36) > at DataMining.testFPG.main(FPGrowth.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at
[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249017#comment-16249017 ] yuhao yang commented on SPARK-22427: Hi [~lyt] does increasing stack size resolve your issue? If not I will look into it. > StackOverFlowError when using FPGrowth > -- > > Key: SPARK-22427 > URL: https://issues.apache.org/jira/browse/SPARK-22427 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 > Environment: Centos Linux 3.10.0-327.el7.x86_64 > java 1.8.0.111 > spark 2.2.0 >Reporter: lyt > > code part: > val path = jobConfig.getString("hdfspath") > val vectordata = sc.sparkContext.textFile(path) > val finaldata = sc.createDataset(vectordata.map(obj => { > obj.split(" ") > }).filter(arr => arr.length > 0)).toDF("items") > val fpg = new FPGrowth() > > fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) > val train = fpg.fit(finaldata) > print(train.freqItemsets.count()) > print(train.associationRules.count()) > train.save("/tmp/FPGModel") > And encountered following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) > at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) > at DataMining.testFPG$.main(FPGrowth.scala:36) > at DataMining.testFPG.main(FPGrowth.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at
[jira] [Created] (SPARK-22502) OnlineLDAOptimizer variationalTopicInference might be able to handle empty documents
yuhao yang created SPARK-22502: -- Summary: OnlineLDAOptimizer variationalTopicInference might be able to handle empty documents Key: SPARK-22502 URL: https://issues.apache.org/jira/browse/SPARK-22502 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Trivial Currently we assume OnlineLDAOptimizer.variationalTopicInference cannot take empty documents and added a few checks during training and inference. Yet I tested and in my local env sending empty vectors to OnlineLDAOptimizer.variationalTopicInference does not trigger any error. If this is true, maybe we can remove the extra check. Please be cautious as compared with the gain (some code cleaning and little performance improvement), we do want to avoid a regression. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18755) Add Randomized Grid Search to Spark ML
[ https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247870#comment-16247870 ] yuhao yang commented on SPARK-18755: Thanks for all the interests. For anyone who wants to contribute on the item, IMO we need to support the Random Grid Search function as in sklearn or other popular libraries. The initial PR can start with a basic prototype but should contain plans for supporting future extension for function parity. Also since we add the randomized search primarily to speedup the tuning process, it's best if we can present some benchmark on public dataset to demonstrate the effectiveness. also cc [~srowen] [~mlnick] [~yanboliang] [~holdenk] to see if any one has bandwidth to Shepherd this. I can help review. Thanks. > Add Randomized Grid Search to Spark ML > -- > > Key: SPARK-18755 > URL: https://issues.apache.org/jira/browse/SPARK-18755 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang > > Randomized Grid Search implements a randomized search over parameters, where > each setting is sampled from a distribution over possible parameter values. > This has two main benefits over an exhaustive search: > 1. A budget can be chosen independent of the number of parameters and > possible values. > 2. Adding parameters that do not influence the performance does not decrease > efficiency. > Randomized Grid search usually gives similar result as exhaustive search, > while the run time for randomized search is drastically lower. > For more background, please refer to: > sklearn: http://scikit-learn.org/stable/modules/grid_search.html > http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ > http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf > https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. > There're two ways to implement this in Spark as I see: > 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during > build. Only 1 new public function is required. > 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator > and RandomizedTrainValiationSplit, which can be complicated since we need to > deal with the models. > I'd prefer option 1 as it's much simpler and straightforward. We can support > Randomized grid search via some smallest change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237174#comment-16237174 ] yuhao yang commented on SPARK-22427: Could you please try to increase the stack size, E.g. with -Xss10m ? > StackOverFlowError when using FPGrowth > -- > > Key: SPARK-22427 > URL: https://issues.apache.org/jira/browse/SPARK-22427 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 > Environment: Centos Linux 3.10.0-327.el7.x86_64 > java 1.8.0.111 > spark 2.2.0 >Reporter: lyt >Priority: Normal > > code part: > val path = jobConfig.getString("hdfspath") > val vectordata = sc.sparkContext.textFile(path) > val finaldata = sc.createDataset(vectordata.map(obj => { > obj.split(" ") > }).filter(arr => arr.length > 0)).toDF("items") > val fpg = new FPGrowth() > > fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) > val train = fpg.fit(finaldata) > print(train.freqItemsets.count()) > print(train.associationRules.count()) > train.save("/tmp/FPGModel") > And encountered following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) > at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) > at DataMining.testFPG$.main(FPGrowth.scala:36) > at DataMining.testFPG.main(FPGrowth.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at
[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227094#comment-16227094 ] yuhao yang commented on SPARK-13030: I see. Thanks for the response [~mlnick]. The Estimator is necessary if we want to automatically infer the size. Then for adding the extra param size or not, I guess it will be useful in the case that automatic inference should not be used (E.g. Sampling before training). I would vote for adding. > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226307#comment-16226307 ] yuhao yang commented on SPARK-13030: Sorry to jumping in so late. I can see there's been a lot of efforts. As far as I understand, making the OneHotEncoder an Estimator is essentially to fulfill the requirement that we need consistent dimension and mapping for OneHotEncoder during training and prediction. To achieve the same target, can we just set an optional numCategory: IntParam (or call it size) as an parameter for OneHotEncoder ? If set, then all the output vector will have the size as numCategory. Any index that's out of the bound of numCategory can be resolved by handleInvalid. Comparably, IMO this is a much simpler and robust solution. (totally backwards compatible). > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22381) Add StringParam that supports valid options
yuhao yang created SPARK-22381: -- Summary: Add StringParam that supports valid options Key: SPARK-22381 URL: https://issues.apache.org/jira/browse/SPARK-22381 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor During test with https://issues.apache.org/jira/browse/SPARK-22331, I found it might be a good idea to include the possible options in a StringParam. A StringParam extends Param[String] and allow user to specify the valid options in Array[String] (case insensitive). So far it can help achieve three goals: 1. Make the StringParam aware of its possible options and support native validations. 2. StringParam can list the supported options when user input wrong value. 3. allow automatic unit test coverage for case-insensitive String param and IMO it also decrease the code redundancy. The StringParam is designed to be completely compatible with existing Param[String], just adding the extra logic for supporting options, which means we don't need to convert all Param[String] to StringParam until we feel comfortable to do that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18755) Add Randomized Grid Search to Spark ML
[ https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221800#comment-16221800 ] yuhao yang commented on SPARK-18755: Thanks for sending the update here. Feel free to send a PR as you wish. I'm interested in the topic and can help with review. Yet since none of the committers stopped by here, I guess the review process will be very long. > Add Randomized Grid Search to Spark ML > -- > > Key: SPARK-18755 > URL: https://issues.apache.org/jira/browse/SPARK-18755 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang > > Randomized Grid Search implements a randomized search over parameters, where > each setting is sampled from a distribution over possible parameter values. > This has two main benefits over an exhaustive search: > 1. A budget can be chosen independent of the number of parameters and > possible values. > 2. Adding parameters that do not influence the performance does not decrease > efficiency. > Randomized Grid search usually gives similar result as exhaustive search, > while the run time for randomized search is drastically lower. > For more background, please refer to: > sklearn: http://scikit-learn.org/stable/modules/grid_search.html > http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ > http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf > https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. > There're two ways to implement this in Spark as I see: > 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during > build. Only 1 new public function is required. > 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator > and RandomizedTrainValiationSplit, which can be complicated since we need to > deal with the models. > I'd prefer option 1 as it's much simpler and straightforward. We can support > Randomized grid search via some smallest change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22331) Make MLlib string params case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215489#comment-16215489 ] yuhao yang commented on SPARK-22331: Yes, I don't see the change will break any existing code. > Make MLlib string params case-insensitive > - > > Key: SPARK-22331 > URL: https://issues.apache.org/jira/browse/SPARK-22331 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > Some String params in ML are still case-sensitive, as they are checked by > ParamValidators.inArray. > For consistency in user experience, there should be some general guideline in > whether String params in Spark MLlib are case-insensitive or not. > I'm leaning towards making all String params case-insensitive where possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not
[ https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214667#comment-16214667 ] yuhao yang commented on SPARK-22331: cc [~WeichenXu123] > Strength consistency for supporting string params: case-insensitive or not > -- > > Key: SPARK-22331 > URL: https://issues.apache.org/jira/browse/SPARK-22331 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > Some String params in ML are still case-sensitive, as they are checked by > ParamValidators.inArray. > For consistency in user experience, there should be some general guideline in > whether String params in Spark MLlib are case-insensitive or not. > I'm leaning towards making all String params case-insensitive where possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not
yuhao yang created SPARK-22331: -- Summary: Strength consistency for supporting string params: case-insensitive or not Key: SPARK-22331 URL: https://issues.apache.org/jira/browse/SPARK-22331 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor Some String params in ML are still case-sensitive, as they are checked by ParamValidators.inArray. For consistency in user experience, there should be some general guideline in whether String params in Spark MLlib are case-insensitive or not. I'm leaning towards making all String params case-insensitive where possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208614#comment-16208614 ] yuhao yang commented on SPARK-22289: Thanks for the reply. I'll start compose a PR. > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207115#comment-16207115 ] yuhao yang commented on SPARK-22289: cc [~yanboliang] [~dbtsai] > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207063#comment-16207063 ] yuhao yang edited comment on SPARK-22289 at 10/17/17 6:43 AM: -- Thanks for reporting the issue. Should be a straight-forward fix. Yet maybe we should cover it better in release QA. There're two ways to support this as I see: 1. Support save/load on LogisticRegressionParams, and also adjust the save/load in LogisticRegression and LogisticRegressionModel. 2. Directly support Matrix in Param.jsonEncode, similar to what we have done for Vector. IMO we need to collect opinions before sending a fix. Welcome to send other options. I'm leaning towards 2, for simplicity and convenience for other classes. was (Author: yuhaoyan): Thanks for reporting the issue. Should be a straight-forward fix. Yet we should not miss this in the Release QA. Please send response if anyone has already started working on this. Otherwise I'll send a fix. > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail:
[jira] [Comment Edited] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207063#comment-16207063 ] yuhao yang edited comment on SPARK-22289 at 10/17/17 6:28 AM: -- Thanks for reporting the issue. Should be a straight-forward fix. Yet we should not miss this in the Release QA. Please send response if anyone has already started working on this. Otherwise I'll send a fix. was (Author: yuhaoyan): Thanks for reporting the issue. Should be a straight-forward fix. Yet we should not miss this in the Release QA. Let send response if anyone has already started working on this. Otherwise I'll send a fix. > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207063#comment-16207063 ] yuhao yang commented on SPARK-22289: Thanks for reporting the issue. Should be a straight-forward fix. Yet we should not miss this in the Release QA. Let send response if anyone has already started working on this. Otherwise I'll send a fix. > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors
[ https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193844#comment-16193844 ] yuhao yang edited comment on SPARK-22195 at 10/6/17 7:33 AM: - Thanks for the feedback. I don't see the existing implementation (RowMatrix or in Word2Vec) can fulfill the two scenarios: 1. Compute cosine similarity between two arbitrary vectors. 2. Compute cosine similarity between one vector and a group of other Vectors (usually candidates). And I'm afraid that not everyone using Spark ML knows how to implement cosine similarity. was (Author: yuhaoyan): Thanks for the feedback. I don't see the existing implementation (RowMatrix or in Word2Vec) can fulfill the two scenarios: 1. Compute cosine similarity between two arbitrary vectors. 2. Compute cosine similarity between one vector and a group of other Vectors (usually candidates). And again, not everyone using Spark ML know how to implement cosine similarity. > Add cosine similarity to org.apache.spark.ml.linalg.Vectors > --- > > Key: SPARK-22195 > URL: https://issues.apache.org/jira/browse/SPARK-22195 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > https://en.wikipedia.org/wiki/Cosine_similarity: > As the most important measure of similarity, I found it quite useful in some > image and NLP applications according to personal experience. > Suggest to add function for cosine similarity in > org.apache.spark.ml.linalg.Vectors. > Interface: > def cosineSimilarity(v1: Vector, v2: Vector): Double = ... > def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): > Double = ... > Appreciate suggestions and need green light from committers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors
[ https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193844#comment-16193844 ] yuhao yang commented on SPARK-22195: Thanks for the feedback. I don't see the existing implementation (RowMatrix or in Word2Vec) can fulfill the two scenarios: 1. Compute cosine similarity between two arbitrary vectors. 2. Compute cosine similarity between one vector and a group of other Vectors (usually candidates). And again, not everyone using Spark ML know how to implement cosine similarity. > Add cosine similarity to org.apache.spark.ml.linalg.Vectors > --- > > Key: SPARK-22195 > URL: https://issues.apache.org/jira/browse/SPARK-22195 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > https://en.wikipedia.org/wiki/Cosine_similarity: > As the most important measure of similarity, I found it quite useful in some > image and NLP applications according to personal experience. > Suggest to add function for cosine similarity in > org.apache.spark.ml.linalg.Vectors. > Interface: > def cosineSimilarity(v1: Vector, v2: Vector): Double = ... > def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): > Double = ... > Appreciate suggestions and need green light from committers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior
yuhao yang created SPARK-22210: -- Summary: Online LDA variationalTopicInference should use random seed to have stable behavior Key: SPARK-22210 URL: https://issues.apache.org/jira/browse/SPARK-22210 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582 Gamma distribution should use random seed to have consistent behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192217#comment-16192217 ] yuhao yang commented on SPARK-3181: --- Regarding to whether to separate Huber loss an an independent Estimator, I don't see there's an direct conflict. IMO, LinearRegression should act as an all-in-one Estimator that allow user to combine whichever loss function, optimizer and regularization to use. It should targets flexibility and also provides some fundamental infrastructure for regression algorithms. In the meantime, we may also support HuberRegression, RidgeRegression and others in independent Estimator, which is more convenient but with less flexibility (also allow specific parameters). As mentioned by Seth, this would require better code abstraction and plugin interface. Besides loss/prediction/optimizer, we also need to provide infrastructure for model summary and serialization. This should only happen after we can compose Estimator like HuberRegression without noticeable code duplication. > Add Robust Regression Algorithm with Huber Estimator > > > Key: SPARK-3181 > URL: https://issues.apache.org/jira/browse/SPARK-3181 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Fan Jiang >Assignee: Yanbo Liang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > In 1973, Huber introduced M-estimation for regression which stands for > "maximum likelihood type". The method is resistant to outliers in the > response variable and has been widely used. > The new feature for MLlib will contain 3 new files > /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala > /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala > /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala > and one new class HuberRobustGradient in > /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors
[ https://issues.apache.org/jira/browse/SPARK-22195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190884#comment-16190884 ] yuhao yang commented on SPARK-22195: Exactly, the implementation is straight forward, but I guess not every one knows about it. I was asked several times about if Spark supports cosine similarity computation and I had to do the expatiation. Just want to see if this is a common requirement. > Add cosine similarity to org.apache.spark.ml.linalg.Vectors > --- > > Key: SPARK-22195 > URL: https://issues.apache.org/jira/browse/SPARK-22195 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > https://en.wikipedia.org/wiki/Cosine_similarity: > As the most important measure of similarity, I found it quite useful in some > image and NLP applications according to personal experience. > Suggest to add function for cosine similarity in > org.apache.spark.ml.linalg.Vectors. > Interface: > def cosineSimilarity(v1: Vector, v2: Vector): Double = ... > def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): > Double = ... > Appreciate suggestions and need green light from committers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors
yuhao yang created SPARK-22195: -- Summary: Add cosine similarity to org.apache.spark.ml.linalg.Vectors Key: SPARK-22195 URL: https://issues.apache.org/jira/browse/SPARK-22195 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor https://en.wikipedia.org/wiki/Cosine_similarity: As the most important measure of similarity, I found it quite useful in some image and NLP applications according to personal experience. Suggest to add function for cosine similarity in org.apache.spark.ml.linalg.Vectors. Interface: def cosineSimilarity(v1: Vector, v2: Vector): Double = ... def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): Double = ... Appreciate suggestions and need green light from committers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190239#comment-16190239 ] yuhao yang commented on SPARK-21866: My two cents, 1. In most scenarios, deep learning applications use rescaled/cropped images (typically 256, 224 or smaller). I would add an extra parameter "smallSideSize" to the readImages method, which is more convenient for the users and we don't need to cache the image of original size (which could be 100 times larger than the scaled image). 2. Not sure about the reason to include path info into the image data. Based on my experience, path info serves better as a separate column in the DataFrame. 3. After some argumentation and normalization, the image data will be floating point numbers rather than the bytes. It's fine if the current format is only for reading the image data, but not as the standard image feature exchange format in Spark. 4. I don't see the parameter "recursive" as necessary. Existing wild card matching provides more functions. Part of the image pre-processing code I used (a little stale) is available from https://github.com/hhbyyh/SparkDL, just for reference. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139023#comment-16139023 ] yuhao yang commented on SPARK-21535: Thank for for the comments. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang resolved SPARK-21535. Resolution: Not A Problem The new implementation will load the evaluation dataset when training model and may not always present a better performance. Please refer to the discussion in the PR. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103547#comment-16103547 ] yuhao yang commented on SPARK-21535: It's not in my opinion. https://issues.apache.org/jira/browse/SPARK-21086 is trying to store all the trained models in the TrainValidationSplitModel or CrossValidatorModel according to the discussion, and with a control parameter which is turned off by default. Anyway changing the training process hardly has an impact on that. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100860#comment-16100860 ] yuhao yang edited comment on SPARK-21535 at 7/26/17 6:30 PM: - https://github.com/apache/spark/pull/18733 was (Author: yuhaoyan): https://github.com/apache/spark/pulls > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101870#comment-16101870 ] yuhao yang commented on SPARK-21535: The basic idea is that we should release the driver memory as soon as a trained model is evaluated. I don't think there's any conflict. But let me know if there's any, I'll revert the jira. I'm not a big fan for the Parallel CV idea. Personally I cannot see how it improves the overall performance or ease of use. But maybe it's just I never met the appropriate scenarios. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100860#comment-16100860 ] yuhao yang commented on SPARK-21535: https://github.com/apache/spark/pulls > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-21535: --- Description: CrossValidator and TrainValidationSplit both use {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where epm is Array[ParamMap]. Even though the training process is sequential, current implementation consumes extra driver memory for holding the trained models, which is not necessary and often leads to memory exception for both CrossValidator and TrainValidationSplit. My proposal is to optimize the training implementation, thus that used model can be collected by GC, and avoid the unnecessary OOM exceptions. E.g. when grid search space is 12, old implementation needs to hold all 12 trained models in the driver memory at the same time, while the new implementation only needs to hold 1 trained model at a time, and previous model can be cleared by GC. was: CrossValidator and TrainValidationSplit both use {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where epm is Array[ParamMap]. Even though the training process is sequential, current implementation consumes extra driver memory for holding the trained models, which is not necessary and often leads to memory exception for both CrossValidator and TrainValidationSplit. My proposal is to changing the training implementation to train one model at a time, thus that used local model can be collected by GC, and avoid the unnecessary OOM exceptions. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
yuhao yang created SPARK-21535: -- Summary: Reduce memory requirement for CrossValidator and TrainValidationSplit Key: SPARK-21535 URL: https://issues.apache.org/jira/browse/SPARK-21535 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang CrossValidator and TrainValidationSplit both use {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where epm is Array[ParamMap]. Even though the training process is sequential, current implementation consumes extra driver memory for holding the trained models, which is not necessary and often leads to memory exception for both CrossValidator and TrainValidationSplit. My proposal is to changing the training implementation to train one model at a time, thus that used local model can be collected by GC, and avoid the unnecessary OOM exceptions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala
[ https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100447#comment-16100447 ] yuhao yang commented on SPARK-21087: Withdrawing my PR, anyone with interests please go ahead and work on this. > CrossValidator, TrainValidationSplit should preserve all models after > fitting: Scala > > > Key: SPARK-21087 > URL: https://issues.apache.org/jira/browse/SPARK-21087 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > See parent JIRA -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files
[ https://issues.apache.org/jira/browse/SPARK-21524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099313#comment-16099313 ] yuhao yang commented on SPARK-21524: https://github.com/apache/spark/pull/18728 > ValidatorParamsSuiteHelpers generates wrong temp files > -- > > Key: SPARK-21524 > URL: https://issues.apache.org/jira/browse/SPARK-21524 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the > wrong place and does not delete them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files
yuhao yang created SPARK-21524: -- Summary: ValidatorParamsSuiteHelpers generates wrong temp files Key: SPARK-21524 URL: https://issues.apache.org/jira/browse/SPARK-21524 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the wrong place and does not delete them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14239) Add load for LDAModel that supports both local and distributedModel
[ https://issues.apache.org/jira/browse/SPARK-14239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098948#comment-16098948 ] yuhao yang commented on SPARK-14239: Close overlooked stale jira. > Add load for LDAModel that supports both local and distributedModel > --- > > Key: SPARK-14239 > URL: https://issues.apache.org/jira/browse/SPARK-14239 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Add load for LDAModel that supports loading both local and distributedModel, > as discussed in https://github.com/apache/spark/pull/9894. So that users > don't have to know the details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14239) Add load for LDAModel that supports both local and distributedModel
[ https://issues.apache.org/jira/browse/SPARK-14239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang resolved SPARK-14239. Resolution: Won't Do > Add load for LDAModel that supports both local and distributedModel > --- > > Key: SPARK-14239 > URL: https://issues.apache.org/jira/browse/SPARK-14239 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Add load for LDAModel that supports loading both local and distributedModel, > as discussed in https://github.com/apache/spark/pull/9894. So that users > don't have to know the details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer
[ https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098946#comment-16098946 ] yuhao yang commented on SPARK-12875: Close stale jira. > Add Weight of Evidence and Information value to Spark.ml as a feature > transformer > - > > Key: SPARK-12875 > URL: https://issues.apache.org/jira/browse/SPARK-12875 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > > As a feature transformer, WOE and IV enable one to: > Consider each variable’s independent contribution to the outcome. > Detect linear and non-linear relationships. > Rank variables in terms of "univariate" predictive strength. > Visualize the correlations between the predictive variables and the binary > outcome. > http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives > a good introduction to WoE and IV. > The Weight of Evidence or WoE value provides a measure of how well a > grouping of feature is able to distinguish between a binary response (e.g. > "good" versus "bad"), which is widely used in grouping continuous feature or > mapping categorical features to continuous values. It is computed from the > basic odds ratio: > (Distribution of positive Outcomes) / (Distribution of negative Outcomes) > where Distr refers to the proportion of positive or negative in the > respective group, relative to the column totals. > The WoE recoding of features is particularly well suited for subsequent > modeling using Logistic Regression or MLP. > In addition, the information value or IV can be computed based on WoE, which > is a popular technique to select variables in a predictive model. > TODO: Currently we support only calculation for categorical features. Add an > estimator to estimate the proper grouping for continuous feature. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer
[ https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang resolved SPARK-12875. Resolution: Won't Do > Add Weight of Evidence and Information value to Spark.ml as a feature > transformer > - > > Key: SPARK-12875 > URL: https://issues.apache.org/jira/browse/SPARK-12875 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > > As a feature transformer, WOE and IV enable one to: > Consider each variable’s independent contribution to the outcome. > Detect linear and non-linear relationships. > Rank variables in terms of "univariate" predictive strength. > Visualize the correlations between the predictive variables and the binary > outcome. > http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives > a good introduction to WoE and IV. > The Weight of Evidence or WoE value provides a measure of how well a > grouping of feature is able to distinguish between a binary response (e.g. > "good" versus "bad"), which is widely used in grouping continuous feature or > mapping categorical features to continuous values. It is computed from the > basic odds ratio: > (Distribution of positive Outcomes) / (Distribution of negative Outcomes) > where Distr refers to the proportion of positive or negative in the > respective group, relative to the column totals. > The WoE recoding of features is particularly well suited for subsequent > modeling using Logistic Regression or MLP. > In addition, the information value or IV can be computed based on WoE, which > is a popular technique to select variables in a predictive model. > TODO: Currently we support only calculation for categorical features. Add an > estimator to estimate the proper grouping for continuous feature. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098940#comment-16098940 ] yuhao yang edited comment on SPARK-14760 at 7/24/17 6:23 PM: - Close stale jira since it's been overlooked for some time. Thanks for the review and comments. was (Author: yuhaoyan): Close it since it's been overlooked for some time. Thanks for the review and comments. > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098940#comment-16098940 ] yuhao yang commented on SPARK-14760: Close it since it's been overlooked for some time. Thanks for the review and comments. > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13223) Add stratified sampling to ML feature engineering
[ https://issues.apache.org/jira/browse/SPARK-13223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang resolved SPARK-13223. Resolution: Not A Problem > Add stratified sampling to ML feature engineering > - > > Key: SPARK-13223 > URL: https://issues.apache.org/jira/browse/SPARK-13223 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > > I found it useful to add an sampling transformer during a case of fraud > detection. It can be used in resampling or overSampling, which in turn is > required by ensemble and unbalanced data processing. > Internally, it invoke the sampleByKey in Pair RDD operation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13223) Add stratified sampling to ML feature engineering
[ https://issues.apache.org/jira/browse/SPARK-13223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098933#comment-16098933 ] yuhao yang commented on SPARK-13223: Close it since it's been overlooked for some time and can be implemented with #17583 easily. > Add stratified sampling to ML feature engineering > - > > Key: SPARK-13223 > URL: https://issues.apache.org/jira/browse/SPARK-13223 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > > I found it useful to add an sampling transformer during a case of fraud > detection. It can be used in resampling or overSampling, which in turn is > required by ensemble and unbalanced data processing. > Internally, it invoke the sampleByKey in Pair RDD operation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting
[ https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097062#comment-16097062 ] yuhao yang commented on SPARK-21086: sure, indices sounds fine. For the driver memory, especially for CrossValidator, caching all the trained models would be impractical and not necessary. Even though all the models are collected to the driver, but it's a sequential process. And with the current implementation of CrossValidator, GC can kick in and clear all the previous models which is especially practical for large models. > CrossValidator, TrainValidationSplit should preserve all models after fitting > - > > Key: SPARK-21086 > URL: https://issues.apache.org/jira/browse/SPARK-21086 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > I've heard multiple requests for having CrossValidatorModel and > TrainValidationSplitModel preserve the full list of fitted models. This > sounds very valuable. > One decision should be made before we do this: Should we save and load the > models in ML persistence? That could blow up the size of a saved Pipeline if > the models are large. > * I suggest *not* saving the models by default but allowing saving if > specified. We could specify whether to save the model as an extra Param for > CrossValidatorModelWriter, but we would have to make sure to expose > CrossValidatorModelWriter as a public API and modify the return type of > CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not > be a breaking change). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18724) Add TuningSummary for TrainValidationSplit and CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-18724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-18724: --- Summary: Add TuningSummary for TrainValidationSplit and CountVectorizer (was: Add TuningSummary for TrainValidationSplit) > Add TuningSummary for TrainValidationSplit and CountVectorizer > -- > > Key: SPARK-18724 > URL: https://issues.apache.org/jira/browse/SPARK-18724 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Currently TrainValidationSplitModel only provides tuning metrics in the > format of Array[Double], which makes it harder for tying the metrics back to > the paramMap generating them and affects the usefulness for the tuning > framework. > Add a Tuning Summary to provide better presentation for the tuning metrics, > for now the idea is to use a DataFrame listing all the params and > corresponding metrics. > The Tuning Summary Class can be further extended for CrossValidator. > Refer to https://issues.apache.org/jira/browse/SPARK-18704 for more related > discussion -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase
[ https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073987#comment-16073987 ] yuhao yang edited comment on SPARK-11069 at 7/4/17 6:32 PM: [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be consistent with Tokenizer and accommodate the general user scenarios. The change of behavior was documented in the release notes of 1.6. https://spark.apache.org/releases/spark-release-1-6-0.html You can disable it by setting toLowerCase to false. val regexTokenizer = new RegexTokenizer() *{color:red} .setToLowercase(false){color}* was (Author: yuhaoyan): [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be consistent with Tokenizer and accommodate the general user scenarios. The change of behavior was documented in the release notes of 1.6. https://spark.apache.org/releases/spark-release-1-6-0.html val regexTokenizer = new RegexTokenizer() *{color:red} .setToLowercase(false){color}* > Add RegexTokenizer option to convert to lowercase > - > > Key: SPARK-11069 > URL: https://issues.apache.org/jira/browse/SPARK-11069 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Minor > Fix For: 1.6.0 > > > Tokenizer converts strings to lowercase automatically, but RegexTokenizer > does not. It would be nice to add an option to RegexTokenizer to convert to > lowercase. Proposal: > * call the Boolean Param "toLowercase" > * set default to false (so behavior does not change) > *Q*: Should conversion to lowercase happen before or after regex matching? > * Before: This is simpler. > * After: This gives the user full control since they can have the regex treat > upper/lower case differently. > --> I'd vote for conversion before matching. If a user needs full control, > they can convert to lowercase manually. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase
[ https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073987#comment-16073987 ] yuhao yang edited comment on SPARK-11069 at 7/4/17 6:31 PM: [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be consistent with Tokenizer and accommodate the general user scenarios. The change of behavior was documented in the release notes of 1.6. https://spark.apache.org/releases/spark-release-1-6-0.html val regexTokenizer = new RegexTokenizer() *{color:red} .setToLowercase(false){color}* was (Author: yuhaoyan): [~levente.torok.ge] use val regexTokenizer = new RegexTokenizer() *{color:red} .setToLowercase(false){color}* > Add RegexTokenizer option to convert to lowercase > - > > Key: SPARK-11069 > URL: https://issues.apache.org/jira/browse/SPARK-11069 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Minor > Fix For: 1.6.0 > > > Tokenizer converts strings to lowercase automatically, but RegexTokenizer > does not. It would be nice to add an option to RegexTokenizer to convert to > lowercase. Proposal: > * call the Boolean Param "toLowercase" > * set default to false (so behavior does not change) > *Q*: Should conversion to lowercase happen before or after regex matching? > * Before: This is simpler. > * After: This gives the user full control since they can have the regex treat > upper/lower case differently. > --> I'd vote for conversion before matching. If a user needs full control, > they can convert to lowercase manually. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11069) Add RegexTokenizer option to convert to lowercase
[ https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16073987#comment-16073987 ] yuhao yang commented on SPARK-11069: [~levente.torok.ge] use val regexTokenizer = new RegexTokenizer() *{color:red} .setToLowercase(false){color}* > Add RegexTokenizer option to convert to lowercase > - > > Key: SPARK-11069 > URL: https://issues.apache.org/jira/browse/SPARK-11069 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Minor > Fix For: 1.6.0 > > > Tokenizer converts strings to lowercase automatically, but RegexTokenizer > does not. It would be nice to add an option to RegexTokenizer to convert to > lowercase. Proposal: > * call the Boolean Param "toLowercase" > * set default to false (so behavior does not change) > *Q*: Should conversion to lowercase happen before or after regex matching? > * Before: This is simpler. > * After: This gives the user full control since they can have the regex treat > upper/lower case differently. > --> I'd vote for conversion before matching. If a user needs full control, > they can convert to lowercase manually. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070883#comment-16070883 ] yuhao yang commented on SPARK-20082: I'm OK with only supporting initialModel for Online LDA now. For EM LDA, an initial model is also possible, but we may need some extra check depending on if EM can fit on new documents. I'll make a pass on the current implementation. But we still need the opinion and final check from [~josephkb] or other committers. > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu DESPRIEE > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19053) Supporting multiple evaluation metrics in DataFrame-based API: discussion
[ https://issues.apache.org/jira/browse/SPARK-19053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070849#comment-16070849 ] yuhao yang commented on SPARK-19053: Not sure if this is still wanted. cc [~josephkb] And I'd like to understand if this jira is about performance improvement or API refine. Evaluator classes in ml basically invoke the mllib implementation and compute the metrics in one pass as I understand. Will this change the return type of the Evaluator.evaluate() method? Currently it's Double. > Supporting multiple evaluation metrics in DataFrame-based API: discussion > - > > Key: SPARK-19053 > URL: https://issues.apache.org/jira/browse/SPARK-19053 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is to discuss supporting the computation of multiple evaluation > metrics efficiently in the DataFrame-based API for MLlib. > In the RDD-based API, RegressionMetrics and other *Metrics classes support > efficient computation of multiple metrics. > In the DataFrame-based API, there are a few options: > * model/result summaries (e.g., LogisticRegressionSummary): These currently > provide the desired functionality, but they require a model and do not let > users compute metrics manually from DataFrames of predictions and true labels. > * Evaluator classes (e.g., RegressionEvaluator): These only support computing > a single metric in one pass over the data, but they do not require a model. > * new class analogous to Metrics: We could introduce a class analogous to > Metrics. Model/result summaries could use this internally as a replacement > for spark.mllib Metrics classes, or they could (maybe) inherit from these > classes. > Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml
[ https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067494#comment-16067494 ] yuhao yang commented on SPARK-18441: Move the Smote code to https://gist.github.com/hhbyyh/346467373014943a7f20df208caeb19b > Add Smote in spark mlib and ml > -- > > Key: SPARK-18441 > URL: https://issues.apache.org/jira/browse/SPARK-18441 > Project: Spark > Issue Type: Wish > Components: ML, MLlib >Affects Versions: 2.0.1 >Reporter: lichenglin > > PLZ Add Smote in spark mlib and ml in case of the "not balance of train > data" for Classification -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065694#comment-16065694 ] yuhao yang commented on SPARK-21152: This is something that we should investigate anyway. By GEMM, do you mean you will treat the coefficients as a Matrix even it's actually a vector? Before the implementation, I think it's necessary to check the GEMM speedup when multiplying matrix and vector, which could be quite different from normal GEMM. > Use level 3 BLAS operations in LogisticAggregator > - > > Key: SPARK-21152 > URL: https://issues.apache.org/jira/browse/SPARK-21152 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.1 >Reporter: Seth Hendrickson > > In logistic regression gradient update, we currently compute by each > individual row. If we blocked the rows together, we can do a blocked gradient > update which leverages the BLAS GEMM operation. > On high dimensional dense datasets, I've observed ~10x speedups. The problem > here, though, is that it likely won't improve the sparse case so we need to > keep both implementations around, and this blocked algorithm will require > caching a new dataset of type: > {code} > BlockInstance(label: Vector, weight: Vector, features: Matrix) > {code} > We have avoided caching anything beside the original dataset passed to train > in the past because it adds memory overhead if the user has cached this > original dataset for other reasons. Here, I'd like to discuss whether we > think this patch would be worth the investment, given that it only improves a > subset of the use cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21108) convert LinearSVC to aggregator framework
yuhao yang created SPARK-21108: -- Summary: convert LinearSVC to aggregator framework Key: SPARK-21108 URL: https://issues.apache.org/jira/browse/SPARK-21108 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala
[ https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048723#comment-16048723 ] yuhao yang commented on SPARK-21087: I'd like to work on this if my [comment|https://issues.apache.org/jira/browse/SPARK-21086?focusedCommentId=16048647=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16048647] looks reasonable. > CrossValidator, TrainValidationSplit should preserve all models after > fitting: Scala > > > Key: SPARK-21087 > URL: https://issues.apache.org/jira/browse/SPARK-21087 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > See parent JIRA -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting
[ https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048647#comment-16048647 ] yuhao yang edited comment on SPARK-21086 at 6/14/17 5:22 AM: - Sounds good. About the default path for saving different models, how about we use the flatten parameter as the file name. e.g. LogisticRegressionModel-maxIter-100-regParam-0.1 And I would not implement it with the ML Persistence Framework, simply because caching the models in memory would be expensive (especially impractical for driver memory) and would impact the existing usage of CrossValidator (Slower or OOM). I would recommend adding an expert param and save the models during training. was (Author: yuhaoyan): Sounds good. About the default path for saving different models, how about we use the flatten parameter as the file name. e.g. LogisticRegressionModel-maxIter-100-regParam-0.1 And I would not implement it with the ML Persistence Framework, simply because caching the models in memory would be expensive and would impact the existing usage of CrossValidator (Slower or OOM). I would recommend adding an expert param and save the models during training. > CrossValidator, TrainValidationSplit should preserve all models after fitting > - > > Key: SPARK-21086 > URL: https://issues.apache.org/jira/browse/SPARK-21086 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > I've heard multiple requests for having CrossValidatorModel and > TrainValidationSplitModel preserve the full list of fitted models. This > sounds very valuable. > One decision should be made before we do this: Should we save and load the > models in ML persistence? That could blow up the size of a saved Pipeline if > the models are large. > * I suggest *not* saving the models by default but allowing saving if > specified. We could specify whether to save the model as an extra Param for > CrossValidatorModelWriter, but we would have to make sure to expose > CrossValidatorModelWriter as a public API and modify the return type of > CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not > be a breaking change). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting
[ https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048647#comment-16048647 ] yuhao yang edited comment on SPARK-21086 at 6/14/17 5:12 AM: - Sounds good. About the default path for saving different models, how about we use the flatten parameter as the file name. e.g. LogisticRegressionModel-maxIter-100-regParam-0.1 And I would not implement it with the ML Persistence Framework, simply because caching the models in memory would be expensive and would impact the existing usage of CrossValidator (Slower or OOM). I would recommend adding an expert param and save the models during training. was (Author: yuhaoyan): Sounds good. About the default path for saving different models, how about we use the flatten parameter as the file name. e.g. LogisticRegressionModel-maxIter-100-regParam-0.1 > CrossValidator, TrainValidationSplit should preserve all models after fitting > - > > Key: SPARK-21086 > URL: https://issues.apache.org/jira/browse/SPARK-21086 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > I've heard multiple requests for having CrossValidatorModel and > TrainValidationSplitModel preserve the full list of fitted models. This > sounds very valuable. > One decision should be made before we do this: Should we save and load the > models in ML persistence? That could blow up the size of a saved Pipeline if > the models are large. > * I suggest *not* saving the models by default but allowing saving if > specified. We could specify whether to save the model as an extra Param for > CrossValidatorModelWriter, but we would have to make sure to expose > CrossValidatorModelWriter as a public API and modify the return type of > CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not > be a breaking change). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework
[ https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048698#comment-16048698 ] yuhao yang commented on SPARK-20988: Eh.. I was trying to add the squared_hinge loss to LinearSVC and already converted LinearSVC to use the aggregator framework in SPARK-20602 https://github.com/apache/spark/pull/17862. cc [~VinceXie] > Convert logistic regression to new aggregator framework > --- > > Key: SPARK-20988 > URL: https://issues.apache.org/jira/browse/SPARK-20988 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Priority: Minor > > Use the hierarchy from SPARK-19762 for logistic regression optimization -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang resolved SPARK-20348. Resolution: Duplicate Combine it with SPARK-20602 and resolve this as duplicate. > Support squared hinge loss (L2 loss) for LinearSVC > -- > > Key: SPARK-20348 > URL: https://issues.apache.org/jira/browse/SPARK-20348 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > While Hinge loss is the standard loss function for linear SVM, Squared hinge > loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable > and imposes a bigger (quadratic vs. linear) loss for points which violate the > margin. Some introduction can be found from > http://mccormickml.com/2015/01/06/what-is-an-l2-svm/ > Liblinear and [scikit > learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html] > both offer squared hinge loss as the default loss function for linear SVM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048663#comment-16048663 ] yuhao yang commented on SPARK-20602: Combining this with SPARK-20348. Support squared hinge loss (L2 loss) for LinearSVC. And close SPARK-20348 > Adding LBFGS optimizer and Squared_hinge loss for LinearSVC > --- > > Key: SPARK-20602 > URL: https://issues.apache.org/jira/browse/SPARK-20602 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check > https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between > LBFGS and OWLQN on several public dataset and found LBFGS converges much > faster for LinearSVC in most cases. > The following table presents the number of training iterations and f1 score > of both optimizers until convergence > ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge|| > |news20.binary| 31 (0.99) | 413(0.99) | 185 (0.99) | > |mushroom| 28(1.0) | 170(1.0)| 24(1.0) | > |madelon|143(0.75) | 8129(0.70)| 823(0.74) | > |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) | > |phishing | 329(0.94) | 231(0.94) | 67 (0.94) | > |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) | > |a7a | 237 (0.84) | 372(0.84) | 69(0.84) | > data source: > https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html > training code: new LinearSVC().setMaxIter(1).setTol(1e-6) > LBFGS requires less iterations in most cases (except for a1a) and probably is > a better default optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-20602: --- Summary: Adding LBFGS optimizer and Squared_hinge loss for LinearSVC (was: Adding LBFGS as optimizer for LinearSVC) > Adding LBFGS optimizer and Squared_hinge loss for LinearSVC > --- > > Key: SPARK-20602 > URL: https://issues.apache.org/jira/browse/SPARK-20602 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check > https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between > LBFGS and OWLQN on several public dataset and found LBFGS converges much > faster for LinearSVC in most cases. > The following table presents the number of training iterations and f1 score > of both optimizers until convergence > ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge|| > |news20.binary| 31 (0.99) | 413(0.99) | 185 (0.99) | > |mushroom| 28(1.0) | 170(1.0)| 24(1.0) | > |madelon|143(0.75) | 8129(0.70)| 823(0.74) | > |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) | > |phishing | 329(0.94) | 231(0.94) | 67 (0.94) | > |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) | > |a7a | 237 (0.84) | 372(0.84) | 69(0.84) | > data source: > https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html > training code: new LinearSVC().setMaxIter(1).setTol(1e-6) > LBFGS requires less iterations in most cases (except for a1a) and probably is > a better default optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting
[ https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048647#comment-16048647 ] yuhao yang commented on SPARK-21086: Sounds good. About the default path for saving different models, how about we use the flatten parameter as the file name. e.g. LogisticRegressionModel-maxIter-100-regParam-0.1 > CrossValidator, TrainValidationSplit should preserve all models after fitting > - > > Key: SPARK-21086 > URL: https://issues.apache.org/jira/browse/SPARK-21086 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > I've heard multiple requests for having CrossValidatorModel and > TrainValidationSplitModel preserve the full list of fitted models. This > sounds very valuable. > One decision should be made before we do this: Should we save and load the > models in ML persistence? That could blow up the size of a saved Pipeline if > the models are large. > * I suggest *not* saving the models by default but allowing saving if > specified. We could specify whether to save the model as an extra Param for > CrossValidatorModelWriter, but we would have to make sure to expose > CrossValidatorModelWriter as a public API and modify the return type of > CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not > be a breaking change). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20602) Adding LBFGS as optimizer for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-20602: --- Description: Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between LBFGS and OWLQN on several public dataset and found LBFGS converges much faster for LinearSVC in most cases. The following table presents the number of training iterations and f1 score of both optimizers until convergence ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge|| |news20.binary| 31 (0.99) | 413(0.99) | 185 (0.99) | |mushroom| 28(1.0) | 170(1.0)| 24(1.0) | |madelon|143(0.75) | 8129(0.70)| 823(0.74) | |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) | |phishing | 329(0.94) | 231(0.94) | 67 (0.94) | |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) | |a7a | 237 (0.84) | 372(0.84) | 69(0.84) | data source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html training code: new LinearSVC().setMaxIter(1).setTol(1e-6) LBFGS requires less iterations in most cases (except for a1a) and probably is a better default optimizer. was: Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between LBFGS and OWLQN on several public dataset and found LBFGS converges much faster for LinearSVC in most cases. The following table presents the number of training iterations and f1 score of both optimizers until convergence ||Dataset||LBFGS||OWLQN|| |news20.binary| 31 (0.99) | 413(0.99) | |mushroom| 28(1.0) | 170(1.0)| |madelon|143(0.75) | 8129(0.70)| |breast-cancer-scale| 15(1.0) | 16(1.0)| |phishing | 329(0.94) | 231(0.94) | |a1a(adult) | 466 (0.87) | 282 (0.87) | |a7a | 237 (0.84) | 372(0.84) | data source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html training code: new LinearSVC().setMaxIter(1).setTol(1e-6) LBFGS requires less iterations in most cases (except for a1a) and probably is a better default optimizer. > Adding LBFGS as optimizer for LinearSVC > --- > > Key: SPARK-20602 > URL: https://issues.apache.org/jira/browse/SPARK-20602 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check > https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between > LBFGS and OWLQN on several public dataset and found LBFGS converges much > faster for LinearSVC in most cases. > The following table presents the number of training iterations and f1 score > of both optimizers until convergence > ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge|| > |news20.binary| 31 (0.99) | 413(0.99) | 185 (0.99) | > |mushroom| 28(1.0) | 170(1.0)| 24(1.0) | > |madelon|143(0.75) | 8129(0.70)| 823(0.74) | > |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) | > |phishing | 329(0.94) | 231(0.94) | 67 (0.94) | > |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) | > |a7a | 237 (0.84) | 372(0.84) | 69(0.84) | > data source: > https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html > training code: new LinearSVC().setMaxIter(1).setTol(1e-6) > LBFGS requires less iterations in most cases (except for a1a) and probably is > a better default optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022379#comment-16022379 ] yuhao yang commented on SPARK-20082: refer to https://issues.apache.org/jira/browse/SPARK-20767 for some insights shared by [~cezden] {quote} Technical aspects: 1. The implementation of LDA fitting does not currently allow the coefficients pre-setting (private setter), as noted by a comment in the source code of OnlineLDAOptimizer.setLambda: "This is only used for testing now. In the future, it can help support training stop/resume". 2. The lambda matrix is always randomly initialized by the optimizer, which needs fixing for preset lambda matrix. {quote} > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20767) The training continuation for saved LDA model
[ https://issues.apache.org/jira/browse/SPARK-20767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022375#comment-16022375 ] yuhao yang commented on SPARK-20767: Note there's already an issue about setInitialModel in https://issues.apache.org/jira/browse/SPARK-20082. [~cezden] Thanks for sharing your insight for onlineLDA. Appreciate if you can help review or contribute. > The training continuation for saved LDA model > - > > Key: SPARK-20767 > URL: https://issues.apache.org/jira/browse/SPARK-20767 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Cezary Dendek >Priority: Minor > > Current online implementation of the LDA model fit (OnlineLDAOptimizer) does > not support the model update (ie. to account for the population/covariates > drift) nor the continuation of model fitting in case of the insufficient > number of iterations. > Technical aspects: > 1. The implementation of LDA fitting does not currently allow the > coefficients pre-setting (private setter), as noted by a comment in the > source code of OnlineLDAOptimizer.setLambda: "This is only used for testing > now. In the future, it can help support training stop/resume". > 2. The lambda matrix is always randomly initialized by the optimizer, which > needs fixing for preset lambda matrix. > The adaptation of the classes by the user is not possible due to protected > setters & sealed / final classes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20864) I tried to run spark mllib PIC algorithm, but got error
[ https://issues.apache.org/jira/browse/SPARK-20864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022345#comment-16022345 ] yuhao yang commented on SPARK-20864: [~yuanjie] Could you please provide more code to help the investigation? From the exception it looks like the issue is not caused by the algorithm, but something in the data processing. > I tried to run spark mllib PIC algorithm, but got error > --- > > Key: SPARK-20864 > URL: https://issues.apache.org/jira/browse/SPARK-20864 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 2.1.1 >Reporter: yuanjie >Priority: Blocker > > I use a very simple data: > 1 2 3 > 2 1 3 > 3 1 3 > 4 5 2 > 4 6 2 > 5 6 2 > but when running I got: > Exception in thread "main" : java.io.IOException: > com.google.protobuf.ServiceException: java.lang.UnsupportedOperationException > :This is supposed to be overridden by subclasses > why? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016116#comment-16016116 ] yuhao yang commented on SPARK-20768: Thanks for the ping. [~mlnick] We should just treat it as an expert param. Normally in python it should be exposed as a Param in my impression. > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016061#comment-16016061 ] yuhao yang commented on SPARK-20797: [~d0evi1] Thanks for reporting the issue and proposal for the fix. Would you send a PR for the fix? > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20670) Simplify FPGrowth transform
yuhao yang created SPARK-20670: -- Summary: Simplify FPGrowth transform Key: SPARK-20670 URL: https://issues.apache.org/jira/browse/SPARK-20670 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the transform code in FPGrowthModel can be simplified. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20602) Adding LBFGS as optimizer for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997314#comment-15997314 ] yuhao yang commented on SPARK-20602: cc [~josephkb] > Adding LBFGS as optimizer for LinearSVC > --- > > Key: SPARK-20602 > URL: https://issues.apache.org/jira/browse/SPARK-20602 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check > https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between > LBFGS and OWLQN on several public dataset and found LBFGS converges much > faster for LinearSVC in most cases. > The following table presents the number of training iterations and f1 score > of both optimizers until convergence > ||Dataset||LBFGS||OWLQN|| > |news20.binary| 31 (0.99) | 413(0.99) | > |mushroom| 28(1.0) | 170(1.0)| > |madelon|143(0.75) | 8129(0.70)| > |breast-cancer-scale| 15(1.0) | 16(1.0)| > |phishing | 329(0.94) | 231(0.94) | > |a1a(adult) | 466 (0.87) | 282 (0.87) | > |a7a | 237 (0.84) | 372(0.84) | > data source: > https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html > training code: new LinearSVC().setMaxIter(1).setTol(1e-6) > LBFGS requires less iterations in most cases (except for a1a) and probably is > a better default optimizer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20602) Adding LBFGS as optimizer for LinearSVC
yuhao yang created SPARK-20602: -- Summary: Adding LBFGS as optimizer for LinearSVC Key: SPARK-20602 URL: https://issues.apache.org/jira/browse/SPARK-20602 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between LBFGS and OWLQN on several public dataset and found LBFGS converges much faster for LinearSVC in most cases. The following table presents the number of training iterations and f1 score of both optimizers until convergence ||Dataset||LBFGS||OWLQN|| |news20.binary| 31 (0.99) | 413(0.99) | |mushroom| 28(1.0) | 170(1.0)| |madelon|143(0.75) | 8129(0.70)| |breast-cancer-scale| 15(1.0) | 16(1.0)| |phishing | 329(0.94) | 231(0.94) | |a1a(adult) | 466 (0.87) | 282 (0.87) | |a7a | 237 (0.84) | 372(0.84) | data source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html training code: new LinearSVC().setMaxIter(1).setTol(1e-6) LBFGS requires less iterations in most cases (except for a1a) and probably is a better default optimizer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20526) Load doesn't work in PCAModel
[ https://issues.apache.org/jira/browse/SPARK-20526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989591#comment-15989591 ] yuhao yang commented on SPARK-20526: Can you please provide more context? like which version of Spark did you use for saving and loading respectively. And perhaps share the save/load code. You can also check the explainedVariance in PCAModel to see if it's null. > Load doesn't work in PCAModel > -- > > Key: SPARK-20526 > URL: https://issues.apache.org/jira/browse/SPARK-20526 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Windows >Reporter: Hayri Volkan Agun > Original Estimate: 336h > Remaining Estimate: 336h > > Error occurs during loading PCAModel. Saved model doesn't load. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20502) ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-20502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989317#comment-15989317 ] yuhao yang commented on SPARK-20502: Check here https://issues.apache.org/jira/browse/SPARK-18319 for previous discussion. I updated the list according to the change we made last release. So far I don't think we need to make any change about the sealed and experimental API. But I listed some final class we have in ml which may be ready to be unmarked. sealed: org.apache.spark.ml.attribute.Attribute org.apache.spark.ml.attribute.AttributeType org.apache.spark.ml.classification.LogisticRegressionTrainingSummary org.apache.spark.ml.classification.LogisticRegressionSummary org.apache.spark.ml.feature.Term org.apache.spark.ml.feature.InteractableTerm org.apache.spark.ml.optim.WeightedLeastSquares.Solver org.apache.spark.ml.optim.NormalEquationSolver org.apache.spark.ml.tree.Node org.apache.spark.ml.tree.Split org.apache.spark.ml.util.BaseReadWrite org.apache.spark.ml.linalg.Matrix org.apache.spark.ml.linalg.Vector org.apache.spark.mllib.stat.test.StreamingTestMethod org.apache.spark.mllib.tree.model.TreeEnsembleModel Experimental: org.apache.spark.ml.classification.LinearSVC org.apache.spark.ml.classification.LinearSVCModel org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary org.apache.spark.ml.classification.BinaryLogisticRegressionSummary org.apache.spark.ml.clustering.ClusteringSummary org.apache.spark.ml.clustering.BisectingKMeansSummary org.apache.spark.ml.clustering.GaussianMixtureSummary org.apache.spark.ml.clustering.KMeansSummary org.apache.spark.ml.evaluation.BinaryClassificationEvaluator org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator org.apache.spark.ml.evaluation.RegressionEvaluator org.apache.spark.ml.feature.BucketedRandomProjectionLSH(Model) org.apache.spark.ml.feature.Imputer(Model) org.apache.spark.ml.feature.MinHash(Model) org.apache.spark.ml.feature.RFormula(Model) org.apache.spark.ml.fpm.FPGrowth(Model) org.apache.spark.ml.regression.AFTSurvivalRegression(Model) org.apache.spark.ml.regression.GeneralizedLinearRegression(Model) and summary org.apache.spark.ml.regression.LinearRegressionTrainingSummary org.apache.spark.ml.stat.ChiSquareTest org.apache.spark.ml.stat.ChiSquareTest Developer API Most developer API are the basic components for ML pipeline, such like Transformer, Estimator, PipelineStage, Params and Attributes, which I don't see necessary to change. final class: org.apache.spark.ml.classification.OneVsRest org.apache.spark.ml.evaluation.RegressionEvaluator org.apache.spark.ml.feature.Binarizer org.apache.spark.ml.feature.Bucketizer org.apache.spark.ml.feature.ChiSqSelector org.apache.spark.ml.feature.IDF org.apache.spark.ml.feature.QuantileDiscretizer org.apache.spark.ml.feature.VectorSlicer org.apache.spark.ml.feature.Word2Vec org.apache.spark.ml.param.ParamMap Most of the final class here should be ready to be unmarked. I also checked final method and fields (most params) which can be kept the same for now. > ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-20502 > URL: https://issues.apache.org/jira/browse/SPARK-20502 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20351) Add trait hasTrainingSummary to replace the duplicate code
yuhao yang created SPARK-20351: -- Summary: Add trait hasTrainingSummary to replace the duplicate code Key: SPARK-20351 URL: https://issues.apache.org/jira/browse/SPARK-20351 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor Add a trait HasTrainingSummary to avoid code duplicate related to training summary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC
yuhao yang created SPARK-20348: -- Summary: Support squared hinge loss (L2 loss) for LinearSVC Key: SPARK-20348 URL: https://issues.apache.org/jira/browse/SPARK-20348 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor While Hinge loss is the standard loss function for linear SVM, Squared hinge loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable and imposes a bigger (quadratic vs. linear) loss for points which violate the margin. Some introduction can be found from http://mccormickml.com/2015/01/06/what-is-an-l2-svm/ Liblinear and [scikit learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html] both offer squared hinge loss as the default loss function for linear SVM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7128) Add generic bagging algorithm to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965121#comment-15965121 ] yuhao yang commented on SPARK-7128: --- I would vote for adding this now. This is quite helpful in practical applications like fraud detection, and feynmanliang has started with a solid prototype. I can help finish it if this is on the roadmap. > Add generic bagging algorithm to spark.ml > - > > Key: SPARK-7128 > URL: https://issues.apache.org/jira/browse/SPARK-7128 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > The Pipelines API will make it easier to create a generic Bagging algorithm > which can work with any Classifier or Regressor. Creating this feature will > require researching the possible variants and extensions of bagging which we > may want to support now and/or in the future, and planning an API which will > be properly extensible. > Note: This may interact some with the existing tree ensemble methods, but it > should be largely separate since the tree ensemble APIs and implementations > are specialized for trees. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation
[ https://issues.apache.org/jira/browse/SPARK-20271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-20271: --- Description: Just to share some code I implemented to help easily create a custom Transformer in one line of code w. {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 else 0) {code} This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). The transformer can be saved/loaded as other transformer and can be integrated into a pipeline normally. It can be used widely in many use cases like conditional conversion(if...else...), , type conversion, to/from Array, to/from Vector and many string ops.. was: Just to share some code I implemented to help easily create a custom Transformer in one line of code w. {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 else 0) {code} This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). The transformer can be saved/loaded as other transformer and can be integrated into a pipeline normally. It can be used widely in many use cases and you can find some examples in the PR. > Add FuncTransformer to simplify custom transformer creation > --- > > Key: SPARK-20271 > URL: https://issues.apache.org/jira/browse/SPARK-20271 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Just to share some code I implemented to help easily create a custom > Transformer in one line of code w. > {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 > else 0) {code} > This was used in many of my projects and is pretty helpful (Maybe I'm > lazy..). The transformer can be saved/loaded as other transformer and can be > integrated into a pipeline normally. It can be used widely in many use cases > like conditional conversion(if...else...), , type conversion, to/from Array, > to/from Vector and many string ops.. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20271) Add FuncTransformer to simplify custom transformer creation
yuhao yang created SPARK-20271: -- Summary: Add FuncTransformer to simplify custom transformer creation Key: SPARK-20271 URL: https://issues.apache.org/jira/browse/SPARK-20271 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Just to share some code I implemented to help easily create a custom Transformer in one line of code w. {code} val labelConverter = new FuncTransformer((i: Double) => if (i >= 1) 1 else 0) {code} This was used in many of my projects and is pretty helpful (Maybe I'm lazy..). The transformer can be saved/loaded as other transformer and can be integrated into a pipeline normally. It can be used widely in many use cases and you can find some examples in the PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959368#comment-15959368 ] yuhao yang commented on SPARK-20082: Sorry I'm occupied by some internal project this week. I'll find some time to look into it this weekend or early next week. > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955705#comment-15955705 ] yuhao yang commented on SPARK-20203: [~Syrux] Since you got some experiences using the PrefixSpan, I'd like to have your input (or better contribution) in https://issues.apache.org/jira/browse/SPARK-20114 . > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952377#comment-15952377 ] yuhao yang edited comment on SPARK-20180 at 4/1/17 8:14 PM: I assume user can achieve the same effect by setting maxPatternlength to a larger value. So the jira is really about changing the default behavior of PrefixSpan. Is there more background or context available, like why the current default length(10) is not good in practice? Thanks. We need to also consider the performance for larger dataset (in count and dimension). was (Author: yuhaoyan): I assume user can achieve the same effect by setting maxPatternlength to a larger value. So the jira is really about changing the default behavior of PrefixSpan. Is there more background or context available, like why the current default length(10) is not good in practice? Thanks. > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952377#comment-15952377 ] yuhao yang commented on SPARK-20180: I assume user can achieve the same effect by setting maxPatternlength to a larger value. So the jira is really about changing the default behavior of PrefixSpan. Is there more background or context available, like why the current default length(10) is not good in practice? Thanks. > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944239#comment-15944239 ] yuhao yang edited comment on SPARK-20114 at 3/27/17 11:42 PM: -- Currently I prefer to implement the dummy PrefixSpanModel as the sequential rules extracted won't be quite useful. And if needed, we can implement other algorithms to extract sequential rules for prediction. was (Author: yuhaoyan): Currently I prefer to implement the dummy PrefixSpanModel as the sequential rules extracted won't be quite useful. > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944239#comment-15944239 ] yuhao yang commented on SPARK-20114: Currently I prefer to implement the dummy PrefixSpanModel as the sequential rules extracted won't be quite useful. > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-20114: --- Description: Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R API will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. The PrefixSpanModel is only used to provide access for frequent sequential patterns. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. was: Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R API will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel.
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-20114: --- Description: Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R API will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. was: Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not
[jira] [Created] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
yuhao yang created SPARK-20114: -- Summary: spark.ml parity for sequential pattern mining - PrefixSpan Key: SPARK-20114 URL: https://issues.apache.org/jira/browse/SPARK-20114 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Creating this jira to track the feature parity for PrefixSpan and sequential pattern mining in Spark ml with DataFrame API. First list a few design issues to be discussed, then subtasks like Scala, Python and R will be created. # Wrapping the MLlib PrefixSpan and provide a generic fit() should be straightforward. Yet PrefixSpan only extracts frequent sequential patterns, which is not good to be used directly for predicting on new records. Please read http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ for some background knowledge. Thanks Philippe Fournier-Viger for providing insights. If we want to keep using the Estimator/Transformer pattern, options are: #* Implement a dummy transform for PrefixSpanModel, which will not add new column to the input DataSet. #* Adding the feature to extract sequential rules from sequential patterns. Then use the sequential rules in the transform as FPGrowthModel. The rules extracted are of the form X–> Y where X and Y are sequential patterns. But in practice, these rules are not very good as they are too precise and thus not noise tolerant. # Different from association rules and frequent itemsets, sequential rules can be extracted from the original dataset more efficiently using algorithms like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is unordered, but X must appear before Y, which is more general and can work better in practice for prediction. I'd like to hear more from the users to see which kind of Sequential rules are more practical. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20083) Change matrix toArray to not create a new array when matrix is already column major
[ https://issues.apache.org/jira/browse/SPARK-20083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943857#comment-15943857 ] yuhao yang commented on SPARK-20083: So the result array will allow users to manipulate the matrix values. Is it intentional? > Change matrix toArray to not create a new array when matrix is already column > major > --- > > Key: SPARK-20083 > URL: https://issues.apache.org/jira/browse/SPARK-20083 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Seth Hendrickson >Priority: Minor > > {{toArray}} always creates a new array in column major format, even when the > resulting array is the same as the backing values. We should change this to > just return a reference to the values array when it is already column major. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org