[jira] [Commented] (SPARK-19975) Add map_keys and map_values functions to Python
[ https://issues.apache.org/jira/browse/SPARK-19975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929342#comment-15929342 ] Yong Tang commented on SPARK-19975: --- Created a PR for that: https://github.com/apache/spark/pull/17328 Please take a look. > Add map_keys and map_values functions to Python > - > > Key: SPARK-19975 > URL: https://issues.apache.org/jira/browse/SPARK-19975 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Maciej Bryński > > We have `map_keys` and `map_values` functions in SQL. > There is no Python equivalent functions for that. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883417#comment-15883417 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the reminder. I will take a look and update the PR as needed. (I am on the road until next Wednesday. Will try to get it by the end of next week.) > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245084#comment-15245084 ] Yong Tang edited comment on SPARK-14409 at 4/18/16 3:12 AM: Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the {code} @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double {code} in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). was (Author: yongtang): Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the ` @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double ` in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245084#comment-15245084 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the ` @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double ` in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240607#comment-15240607 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] [~josephkb]. Yes I think wrapping RankingMetrics could be the first step and reimplementing all RankingEvaluator methods in ML using DataFrames would be good after that. I will work on the reimplementation in several followup PRs. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14565) RandomForest should use parseInt and parseDouble for feature subset size instead of regexes
[ https://issues.apache.org/jira/browse/SPARK-14565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239396#comment-15239396 ] Yong Tang commented on SPARK-14565: --- Hi [~mengxr], I created a pull request to change regex to parseInt and parseDouble: https://github.com/apache/spark/pull/12360 Please let me know if there are any issues. > RandomForest should use parseInt and parseDouble for feature subset size > instead of regexes > --- > > Key: SPARK-14565 > URL: https://issues.apache.org/jira/browse/SPARK-14565 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Xiangrui Meng >Assignee: Yong Tang > > Using regex is not robust and hard to maintain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238462#comment-15238462 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the review. I was planning to add MRR to RankingMetrics and then wrap that as a first step. But if you think it makes sense, I can reimplement from scratch. Please let me know which way would be better and I will move forward with it. Thanks. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14531) Flume streaming should respect maxRate (and backpressure)
[ https://issues.apache.org/jira/browse/SPARK-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237864#comment-15237864 ] Yong Tang commented on SPARK-14531: --- Thanks [~hermansc], I noticed that my previous understanding may not be correct. Let me do some further investigation and see what I could do to update the pull request. > Flume streaming should respect maxRate (and backpressure) > - > > Key: SPARK-14531 > URL: https://issues.apache.org/jira/browse/SPARK-14531 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Herman Schistad >Priority: Minor > > As far as I can understand the FlumeUtils.createPollingStream(...) ignores > key spark streaming configuration options such as: > spark.streaming.backpressure.enabled > spark.streaming.receiver.maxRate > ... > I might just be mistaken, but since the DEFAULT_POLLING_BATCH_SIZE is set to > 1000 in the source code itself, then I presume it should use the variables > above instead. > *Relevant code:* > https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14546) Scale Wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236556#comment-15236556 ] Yong Tang commented on SPARK-14546: --- [~aloknsingh] I can work on this one if no one has started yet. Thanks. > Scale Wrapper in SparkR > --- > > Key: SPARK-14546 > URL: https://issues.apache.org/jira/browse/SPARK-14546 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Alok Singh > > ML has the StandardScaler and that seems like very commonly used. > This jira is to implement the SparkR wrapper for it . > Here is the R scale command > https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236541#comment-15236541 ] Yong Tang commented on SPARK-14409: --- [~mlnick] [~josephkb] I added a short doc in google driver with comment enabled: https://docs.google.com/document/d/1YEvf5eEm2vRcALJs39yICWmUx6xFW5j8DvXFWbRbStE/edit?usp=sharing Please let me know if there is any feedback. Thanks > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14531) Flume streaming should respect maxRate (and backpressure)
[ https://issues.apache.org/jira/browse/SPARK-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235797#comment-15235797 ] Yong Tang commented on SPARK-14531: --- Hi [~hermansc] I created a pull request https://github.com/apache/spark/pull/12305 to allowing maxRate to be passed in conf. Is this something you expect? Thanks. > Flume streaming should respect maxRate (and backpressure) > - > > Key: SPARK-14531 > URL: https://issues.apache.org/jira/browse/SPARK-14531 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Herman Schistad >Priority: Minor > > As far as I can understand the FlumeUtils.createPollingStream(...) ignores > key spark streaming configuration options such as: > spark.streaming.backpressure.enabled > spark.streaming.receiver.maxRate > ... > I might just be mistaken, but since the DEFAULT_POLLING_BATCH_SIZE is set to > 1000 in the source code itself, then I presume it should use the variables > above instead. > *Relevant code:* > https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228741#comment-15228741 ] Yong Tang commented on SPARK-14409: --- [~josephkb] Sure. Let me do some investigation on other libraries then I will add a design doc. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227743#comment-15227743 ] Yong Tang commented on SPARK-14409: --- [~mlnick] I can work on this issue if no one has started yet. Thanks. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14368) Support python.spark.worker.memory with upper-case unit
[ https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225471#comment-15225471 ] Yong Tang commented on SPARK-14368: --- That is like an easy fix. Will create a pull request shortly. > Support python.spark.worker.memory with upper-case unit > --- > > Key: SPARK-14368 > URL: https://issues.apache.org/jira/browse/SPARK-14368 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Masahiro TANAKA >Priority: Trivial > > According to the > [document|https://spark.apache.org/docs/latest/configuration.html], > spark.python.worker.memory is in the same format as JVM memory string. But > upper-case unit is not allowed in `spark.python.worker.memory`. It should be > allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14335) Describe function command returns wrong output because some of built-in functions are not in function registry.
[ https://issues.apache.org/jira/browse/SPARK-14335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222761#comment-15222761 ] Yong Tang commented on SPARK-14335: --- I can work on this one. Will provide a pull request shortly. > Describe function command returns wrong output because some of built-in > functions are not in function registry. > --- > > Key: SPARK-14335 > URL: https://issues.apache.org/jira/browse/SPARK-14335 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > {code} > %sql describe function `and` > unction: and > Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd > Usage: a and b - Logical and > {code} > The output still shows Hive's function because {{and}} is not in our > FunctionRegistry. Here is a list of such kind of commands > {code} > - > ! > != > * > / > & > % > ^ > + > < > <= > <=> > <> > = > == > > > >= > | > ~ > and > between > case > in > like > not > or > rlike > when > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14301) Java examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220240#comment-15220240 ] Yong Tang commented on SPARK-14301: --- Hi [~yinxusen] would you mind if I work on this issue? Thanks. > Java examples code merge and clean up > - > > Key: SPARK-14301 > URL: https://issues.apache.org/jira/browse/SPARK-14301 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in java/examples/mllib and java/examples/ml: > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * Unsure code duplications of java/ml, double check > ** JavaDeveloperApiExample.java > ** JavaSimpleParamsExample.java > ** JavaSimpleTextClassificationPipeline.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * Unsure code duplications of java/mllib, double check > ** JavaALS.java > ** JavaFPGrowthExample.java > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib
[ https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219289#comment-15219289 ] Yong Tang commented on SPARK-14238: --- Hi [~mlnick], I created a pull request: https://github.com/apache/spark/pull/12079 Let me know if you find any issues or there is anything I need to change. Thanks. > Add binary toggle Param to PySpark HashingTF in ML & MLlib > -- > > Key: SPARK-14238 > URL: https://issues.apache.org/jira/browse/SPARK-14238 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14183) UnsupportedOperationException: empty.max when fitting CrossValidator model
[ https://issues.apache.org/jira/browse/SPARK-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216206#comment-15216206 ] Yong Tang commented on SPARK-14183: --- With the latest master build the message changes to: {code} scala> val model = cv.fit(df) 16/03/29 15:47:29 WARN LogisticRegression: All labels are zero and fitIntercept=true, so the coefficients will be zeros and the intercept will be negative infinity; as a result, training is not needed. java.lang.IllegalArgumentException: requirement failed: Nothing has been added to this summarizer. at scala.Predef$.require(Predef.scala:219) at org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.normL2(MultivariateOnlineSummarizer.scala:270) at org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr$lzycompute(RegressionMetrics.scala:65) at org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr(RegressionMetrics.scala:65) at org.apache.spark.mllib.evaluation.RegressionMetrics.meanSquaredError(RegressionMetrics.scala:99) at org.apache.spark.mllib.evaluation.RegressionMetrics.rootMeanSquaredError(RegressionMetrics.scala:108) at org.apache.spark.ml.evaluation.RegressionEvaluator.evaluate(RegressionEvaluator.scala:94) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:100) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:100) ... 55 elided {code} That looks much better than {noformat}UnsupportedOperationException: empty.max{noformat} > UnsupportedOperationException: empty.max when fitting CrossValidator model > --- > > Key: SPARK-14183 > URL: https://issues.apache.org/jira/browse/SPARK-14183 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > The following code produces {{java.lang.UnsupportedOperationException: > empty.max}}, but it should've said what might've caused that or how to fix it. > The exception: > {code} > scala> val model = cv.fit(df) > java.lang.UnsupportedOperationException: empty.max > at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:227) > at scala.collection.AbstractTraversable.max(Traversable.scala:104) > at > org.apache.spark.ml.classification.MultiClassSummarizer.numClasses(LogisticRegression.scala:739) > at > org.apache.spark.ml.classification.MultiClassSummarizer.histogram(LogisticRegression.scala:743) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:288) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:261) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:160) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at org.apache.spark.ml.Estimator.fit(Estimator.scala:59) > at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at org.apache.spark.ml.Estimator.fit(Estimator.scala:78) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:105) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:105) > ... 55 elided > {code} > The code: > {code} > import org.apache.spark.ml.tuning._ > val cv = new CrossValidator > import org.apache.spark.mllib.linalg._ > val features = Vectors.sparse(3, Array(1), Array(1d)) > val df = Seq((0, "hello world", 0d, features)).toDF("id", "text", "label", > "features") > import org.apache.spark.ml.classification._ > val lr = new LogisticRegression() > import org.apache.spark.ml.evaluation.RegressionEvaluator >
[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib
[ https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216121#comment-15216121 ] Yong Tang commented on SPARK-14238: --- Hi [~mlnick], do you mind if I work on this issue? > Add binary toggle Param to PySpark HashingTF in ML & MLlib > -- > > Key: SPARK-14238 > URL: https://issues.apache.org/jira/browse/SPARK-14238 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3724) RandomForest: More options for feature subset size
[ https://issues.apache.org/jira/browse/SPARK-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213667#comment-15213667 ] Yong Tang commented on SPARK-3724: -- [~josephkb] I just created a pull request for this issue. Let me know if there are any issues. > RandomForest: More options for feature subset size > -- > > Key: SPARK-3724 > URL: https://issues.apache.org/jira/browse/SPARK-3724 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > RandomForest currently supports using a few values for the number of features > to sample per node: all, sqrt, log2, etc. It should support any given value > (to allow model search). > Proposal: If the parameter for specifying the number of features per node is > not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a > numerical value. The value should be either (a) a real value in [0,1] > specifying the fraction of features in each subset or (b) an integer value > specifying the number of features in each subset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3724) RandomForest: More options for feature subset size
[ https://issues.apache.org/jira/browse/SPARK-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213517#comment-15213517 ] Yong Tang commented on SPARK-3724: -- I can work on this one. Will create a PR soon. > RandomForest: More options for feature subset size > -- > > Key: SPARK-3724 > URL: https://issues.apache.org/jira/browse/SPARK-3724 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > RandomForest currently supports using a few values for the number of features > to sample per node: all, sqrt, log2, etc. It should support any given value > (to allow model search). > Proposal: If the parameter for specifying the number of features per node is > not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a > numerical value. The value should be either (a) a real value in [0,1] > specifying the fraction of features in each subset or (b) an integer value > specifying the number of features in each subset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8954) Building Docker Images Fails in 1.4 branch
[ https://issues.apache.org/jira/browse/SPARK-8954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622780#comment-14622780 ] Yong Tang commented on SPARK-8954: -- This failure could be fixed by removing RUN echo deb http://archive.ubuntu.com/ubuntu precise main universe /etc/apt/sources.list from the Dockerfile. A Pull Request https://github.com/apache/spark/pull/7346 is created. This Pull request also removes /var/lib/apt/lists/* at the end of the package install i (in docker), which saves about ~30MB of docker images size. Building Docker Images Fails in 1.4 branch -- Key: SPARK-8954 URL: https://issues.apache.org/jira/browse/SPARK-8954 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Environment: Docker Reporter: Pradeep Bashyal Docker build on branch 1.4 fails when installing the jdk. It expects tzdata-java as a dependency but adding that to the apt-get install list doesn't help. ~/S/s/d/spark-test git:branch-1.4 ❯❯❯ docker build -t spark-test-base base/ ◼ Sending build context to Docker daemon 3.072 kB Sending build context to Docker daemon Step 0 : FROM ubuntu:precise --- 78cef618c77e Step 1 : RUN echo deb http://archive.ubuntu.com/ubuntu precise main universe /etc/apt/sources.list --- Using cache --- 2017472bec85 Step 2 : RUN apt-get update --- Using cache --- 86b8911ead16 Step 3 : RUN apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server --- Running in dc8197a0ea31 Reading package lists... Building dependency tree... Reading state information... Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: openjdk-7-jre-headless : Depends: tzdata-java but it is not going to be installed E: Unable to correct problems, you have held broken packages. INFO[0004] The command [/bin/sh -c apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a non-zero code: 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7155) SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does.
Yong Tang created SPARK-7155: Summary: SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does. Key: SPARK-7155 URL: https://issues.apache.org/jira/browse/SPARK-7155 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Environment: Ubuntu 14.04 Reporter: Yong Tang SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: sc.newAPIHadoopFile(/root/file1.txt,/root/file2.txt, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) will throw org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. The problem is that newAPIHadoopFile() use addInputPath() to add the file path into NewHadoopRDD. See Ln 928-931, master branch: val job = new NewHadoopJob(conf) NewFileInputFormat.addInputPath(job, new Path(path)) val updatedConf = job.getConfiguration new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path) Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will resolve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7155) SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does.
[ https://issues.apache.org/jira/browse/SPARK-7155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yong Tang updated SPARK-7155: - Description: SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: sc.newAPIHadoopFile(/root/file1.txt,/root/file2.txt, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) will throw org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. The problem is that newAPIHadoopFile() use addInputPath() to add the file path into NewHadoopRDD. See Ln 928-931, master branch: val job = new NewHadoopJob(conf) NewFileInputFormat.addInputPath(job, new Path(path)) val updatedConf = job.getConfiguration new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path) Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will resolve this issue. was: SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: sc.newAPIHadoopFile(/root/file1.txt,/root/file2.txt, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) will throw org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. The problem is that newAPIHadoopFile() use addInputPath() to add the file path into NewHadoopRDD. See Ln 928-931, master branch: val job = new NewHadoopJob(conf) NewFileInputFormat.addInputPath(job, new Path(path)) val updatedConf = job.getConfiguration new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path) Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will resolve this issue. SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does. -- Key: SPARK-7155 URL: https://issues.apache.org/jira/browse/SPARK-7155 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Environment: Ubuntu 14.04 Reporter: Yong Tang SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: sc.newAPIHadoopFile(/root/file1.txt,/root/file2.txt, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) will throw org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. The problem is that newAPIHadoopFile() use addInputPath() to add the file path into NewHadoopRDD. See Ln 928-931, master branch: val job = new NewHadoopJob(conf) NewFileInputFormat.addInputPath(job, new Path(path)) val updatedConf = job.getConfiguration new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path) Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will resolve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org