[jira] [Commented] (SPARK-19975) Add map_keys and map_values functions to Python
[ https://issues.apache.org/jira/browse/SPARK-19975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929342#comment-15929342 ] Yong Tang commented on SPARK-19975: --- Created a PR for that: https://github.com/apache/spark/pull/17328 Please take a look. > Add map_keys and map_values functions to Python > - > > Key: SPARK-19975 > URL: https://issues.apache.org/jira/browse/SPARK-19975 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Maciej Bryński > > We have `map_keys` and `map_values` functions in SQL. > There is no Python equivalent functions for that. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883417#comment-15883417 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the reminder. I will take a look and update the PR as needed. (I am on the road until next Wednesday. Will try to get it by the end of next week.) > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084 ] Yong Tang edited comment on SPARK-14409 at 4/18/16 3:12 AM: Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the {code} @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double {code} in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). was (Author: yongtang): Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the ` @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double ` in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the references. I will take a look at those and see what we could do with it. By the way, initially I though I could easily calling RankingMetrics in mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am having some trouble in implementation because the ` @Since("2.0.0") override def evaluate(dataset: Dataset[_]): Double ` in `RankingEvaluator` is not so easy to be converted into RankingMetrics's (`RDD[(Array[T], Array[T])]`). I will do some further investigation. If I can not find a easy way to convert the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly implementing the methods in new ml.evaluation (instead of calling mllib.evaluation). > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240607#comment-15240607 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] [~josephkb]. Yes I think wrapping RankingMetrics could be the first step and reimplementing all RankingEvaluator methods in ML using DataFrames would be good after that. I will work on the reimplementation in several followup PRs. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14565) RandomForest should use parseInt and parseDouble for feature subset size instead of regexes
[ https://issues.apache.org/jira/browse/SPARK-14565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239396#comment-15239396 ] Yong Tang commented on SPARK-14565: --- Hi [~mengxr], I created a pull request to change regex to parseInt and parseDouble: https://github.com/apache/spark/pull/12360 Please let me know if there are any issues. > RandomForest should use parseInt and parseDouble for feature subset size > instead of regexes > --- > > Key: SPARK-14565 > URL: https://issues.apache.org/jira/browse/SPARK-14565 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Xiangrui Meng >Assignee: Yong Tang > > Using regex is not robust and hard to maintain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238462#comment-15238462 ] Yong Tang commented on SPARK-14409: --- Thanks [~mlnick] for the review. I was planning to add MRR to RankingMetrics and then wrap that as a first step. But if you think it makes sense, I can reimplement from scratch. Please let me know which way would be better and I will move forward with it. Thanks. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14531) Flume streaming should respect maxRate (and backpressure)
[ https://issues.apache.org/jira/browse/SPARK-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237864#comment-15237864 ] Yong Tang commented on SPARK-14531: --- Thanks [~hermansc], I noticed that my previous understanding may not be correct. Let me do some further investigation and see what I could do to update the pull request. > Flume streaming should respect maxRate (and backpressure) > - > > Key: SPARK-14531 > URL: https://issues.apache.org/jira/browse/SPARK-14531 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Herman Schistad >Priority: Minor > > As far as I can understand the FlumeUtils.createPollingStream(...) ignores > key spark streaming configuration options such as: > spark.streaming.backpressure.enabled > spark.streaming.receiver.maxRate > ... > I might just be mistaken, but since the DEFAULT_POLLING_BATCH_SIZE is set to > 1000 in the source code itself, then I presume it should use the variables > above instead. > *Relevant code:* > https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14546) Scale Wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236556#comment-15236556 ] Yong Tang commented on SPARK-14546: --- [~aloknsingh] I can work on this one if no one has started yet. Thanks. > Scale Wrapper in SparkR > --- > > Key: SPARK-14546 > URL: https://issues.apache.org/jira/browse/SPARK-14546 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Alok Singh > > ML has the StandardScaler and that seems like very commonly used. > This jira is to implement the SparkR wrapper for it . > Here is the R scale command > https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236541#comment-15236541 ] Yong Tang commented on SPARK-14409: --- [~mlnick] [~josephkb] I added a short doc in google driver with comment enabled: https://docs.google.com/document/d/1YEvf5eEm2vRcALJs39yICWmUx6xFW5j8DvXFWbRbStE/edit?usp=sharing Please let me know if there is any feedback. Thanks > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14531) Flume streaming should respect maxRate (and backpressure)
[ https://issues.apache.org/jira/browse/SPARK-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235797#comment-15235797 ] Yong Tang commented on SPARK-14531: --- Hi [~hermansc] I created a pull request https://github.com/apache/spark/pull/12305 to allowing maxRate to be passed in conf. Is this something you expect? Thanks. > Flume streaming should respect maxRate (and backpressure) > - > > Key: SPARK-14531 > URL: https://issues.apache.org/jira/browse/SPARK-14531 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Herman Schistad >Priority: Minor > > As far as I can understand the FlumeUtils.createPollingStream(...) ignores > key spark streaming configuration options such as: > spark.streaming.backpressure.enabled > spark.streaming.receiver.maxRate > ... > I might just be mistaken, but since the DEFAULT_POLLING_BATCH_SIZE is set to > 1000 in the source code itself, then I presume it should use the variables > above instead. > *Relevant code:* > https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228741#comment-15228741 ] Yong Tang commented on SPARK-14409: --- [~josephkb] Sure. Let me do some investigation on other libraries then I will add a design doc. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227743#comment-15227743 ] Yong Tang commented on SPARK-14409: --- [~mlnick] I can work on this issue if no one has started yet. Thanks. > Investigate adding a RankingEvaluator to ML > --- > > Key: SPARK-14409 > URL: https://issues.apache.org/jira/browse/SPARK-14409 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no > {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful > for recommendation evaluation (and can be useful in other settings > potentially). > Should be thought about in conjunction with adding the "recommendAll" methods > in SPARK-13857, so that top-k ranking metrics can be used in cross-validators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14368) Support python.spark.worker.memory with upper-case unit
[ https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225471#comment-15225471 ] Yong Tang commented on SPARK-14368: --- That is like an easy fix. Will create a pull request shortly. > Support python.spark.worker.memory with upper-case unit > --- > > Key: SPARK-14368 > URL: https://issues.apache.org/jira/browse/SPARK-14368 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Masahiro TANAKA >Priority: Trivial > > According to the > [document|https://spark.apache.org/docs/latest/configuration.html], > spark.python.worker.memory is in the same format as JVM memory string. But > upper-case unit is not allowed in `spark.python.worker.memory`. It should be > allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14335) Describe function command returns wrong output because some of built-in functions are not in function registry.
[ https://issues.apache.org/jira/browse/SPARK-14335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222761#comment-15222761 ] Yong Tang commented on SPARK-14335: --- I can work on this one. Will provide a pull request shortly. > Describe function command returns wrong output because some of built-in > functions are not in function registry. > --- > > Key: SPARK-14335 > URL: https://issues.apache.org/jira/browse/SPARK-14335 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Minor > > {code} > %sql describe function `and` > unction: and > Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd > Usage: a and b - Logical and > {code} > The output still shows Hive's function because {{and}} is not in our > FunctionRegistry. Here is a list of such kind of commands > {code} > - > ! > != > * > / > & > % > ^ > + > < > <= > <=> > <> > = > == > > > >= > | > ~ > and > between > case > in > like > not > or > rlike > when > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14301) Java examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220240#comment-15220240 ] Yong Tang commented on SPARK-14301: --- Hi [~yinxusen] would you mind if I work on this issue? Thanks. > Java examples code merge and clean up > - > > Key: SPARK-14301 > URL: https://issues.apache.org/jira/browse/SPARK-14301 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in java/examples/mllib and java/examples/ml: > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * Unsure code duplications of java/ml, double check > ** JavaDeveloperApiExample.java > ** JavaSimpleParamsExample.java > ** JavaSimpleTextClassificationPipeline.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * Unsure code duplications of java/mllib, double check > ** JavaALS.java > ** JavaFPGrowthExample.java > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib
[ https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219289#comment-15219289 ] Yong Tang commented on SPARK-14238: --- Hi [~mlnick], I created a pull request: https://github.com/apache/spark/pull/12079 Let me know if you find any issues or there is anything I need to change. Thanks. > Add binary toggle Param to PySpark HashingTF in ML & MLlib > -- > > Key: SPARK-14238 > URL: https://issues.apache.org/jira/browse/SPARK-14238 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14183) UnsupportedOperationException: empty.max when fitting CrossValidator model
[ https://issues.apache.org/jira/browse/SPARK-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216206#comment-15216206 ] Yong Tang commented on SPARK-14183: --- With the latest master build the message changes to: {code} scala> val model = cv.fit(df) 16/03/29 15:47:29 WARN LogisticRegression: All labels are zero and fitIntercept=true, so the coefficients will be zeros and the intercept will be negative infinity; as a result, training is not needed. java.lang.IllegalArgumentException: requirement failed: Nothing has been added to this summarizer. at scala.Predef$.require(Predef.scala:219) at org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.normL2(MultivariateOnlineSummarizer.scala:270) at org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr$lzycompute(RegressionMetrics.scala:65) at org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr(RegressionMetrics.scala:65) at org.apache.spark.mllib.evaluation.RegressionMetrics.meanSquaredError(RegressionMetrics.scala:99) at org.apache.spark.mllib.evaluation.RegressionMetrics.rootMeanSquaredError(RegressionMetrics.scala:108) at org.apache.spark.ml.evaluation.RegressionEvaluator.evaluate(RegressionEvaluator.scala:94) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:100) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:100) ... 55 elided {code} That looks much better than {noformat}UnsupportedOperationException: empty.max{noformat} > UnsupportedOperationException: empty.max when fitting CrossValidator model > --- > > Key: SPARK-14183 > URL: https://issues.apache.org/jira/browse/SPARK-14183 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > The following code produces {{java.lang.UnsupportedOperationException: > empty.max}}, but it should've said what might've caused that or how to fix it. > The exception: > {code} > scala> val model = cv.fit(df) > java.lang.UnsupportedOperationException: empty.max > at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:227) > at scala.collection.AbstractTraversable.max(Traversable.scala:104) > at > org.apache.spark.ml.classification.MultiClassSummarizer.numClasses(LogisticRegression.scala:739) > at > org.apache.spark.ml.classification.MultiClassSummarizer.histogram(LogisticRegression.scala:743) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:288) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:261) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:160) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at org.apache.spark.ml.Estimator.fit(Estimator.scala:59) > at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at org.apache.spark.ml.Estimator.fit(Estimator.scala:78) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:105) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:105) > ... 55 elided > {code} > The code: > {code} > import org.apache.spark.ml.tuning._ > val cv = new CrossValidator > import org.apache.spark.mllib.linalg._ > val features = Vectors.sparse(3, Array(1), Array(1d)) > val df = Seq((0, "hello world", 0d, features)).toDF("id", "text", "label", > "features") > import org.apache.spark.ml.classification._ > val lr = new LogisticRegression() > import org.apache.spark.ml.evaluation.Regressi
[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib
[ https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216121#comment-15216121 ] Yong Tang commented on SPARK-14238: --- Hi [~mlnick], do you mind if I work on this issue? > Add binary toggle Param to PySpark HashingTF in ML & MLlib > -- > > Key: SPARK-14238 > URL: https://issues.apache.org/jira/browse/SPARK-14238 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Nick Pentreath >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3724) RandomForest: More options for feature subset size
[ https://issues.apache.org/jira/browse/SPARK-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213667#comment-15213667 ] Yong Tang commented on SPARK-3724: -- [~josephkb] I just created a pull request for this issue. Let me know if there are any issues. > RandomForest: More options for feature subset size > -- > > Key: SPARK-3724 > URL: https://issues.apache.org/jira/browse/SPARK-3724 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > RandomForest currently supports using a few values for the number of features > to sample per node: all, sqrt, log2, etc. It should support any given value > (to allow model search). > Proposal: If the parameter for specifying the number of features per node is > not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a > numerical value. The value should be either (a) a real value in [0,1] > specifying the fraction of features in each subset or (b) an integer value > specifying the number of features in each subset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3724) RandomForest: More options for feature subset size
[ https://issues.apache.org/jira/browse/SPARK-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213517#comment-15213517 ] Yong Tang commented on SPARK-3724: -- I can work on this one. Will create a PR soon. > RandomForest: More options for feature subset size > -- > > Key: SPARK-3724 > URL: https://issues.apache.org/jira/browse/SPARK-3724 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > RandomForest currently supports using a few values for the number of features > to sample per node: all, sqrt, log2, etc. It should support any given value > (to allow model search). > Proposal: If the parameter for specifying the number of features per node is > not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a > numerical value. The value should be either (a) a real value in [0,1] > specifying the fraction of features in each subset or (b) an integer value > specifying the number of features in each subset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8954) Building Docker Images Fails in 1.4 branch
[ https://issues.apache.org/jira/browse/SPARK-8954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622780#comment-14622780 ] Yong Tang commented on SPARK-8954: -- This failure could be fixed by removing RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list from the Dockerfile. A Pull Request https://github.com/apache/spark/pull/7346 is created. This Pull request also removes /var/lib/apt/lists/* at the end of the package install i (in docker), which saves about ~30MB of docker images size. > Building Docker Images Fails in 1.4 branch > -- > > Key: SPARK-8954 > URL: https://issues.apache.org/jira/browse/SPARK-8954 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.4.0 > Environment: Docker >Reporter: Pradeep Bashyal > > Docker build on branch 1.4 fails when installing the jdk. It expects > tzdata-java as a dependency but adding that to the apt-get install list > doesn't help. > ~/S/s/d/spark-test git:branch-1.4 ❯❯❯ docker build -t spark-test-base base/ > ◼ > Sending build context to Docker daemon 3.072 kB > Sending build context to Docker daemon > Step 0 : FROM ubuntu:precise > ---> 78cef618c77e > Step 1 : RUN echo "deb http://archive.ubuntu.com/ubuntu precise main > universe" > /etc/apt/sources.list > ---> Using cache > ---> 2017472bec85 > Step 2 : RUN apt-get update > ---> Using cache > ---> 86b8911ead16 > Step 3 : RUN apt-get install -y less openjdk-7-jre-headless net-tools > vim-tiny sudo openssh-server > ---> Running in dc8197a0ea31 > Reading package lists... > Building dependency tree... > Reading state information... > Some packages could not be installed. This may mean that you have > requested an impossible situation or if you are using the unstable > distribution that some required packages have not yet been created > or been moved out of Incoming. > The following information may help to resolve the situation: > The following packages have unmet dependencies: > openjdk-7-jre-headless : Depends: tzdata-java but it is not going to be > installed > E: Unable to correct problems, you have held broken packages. > INFO[0004] The command [/bin/sh -c apt-get install -y less > openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a > non-zero code: 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7155) SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does.
[ https://issues.apache.org/jira/browse/SPARK-7155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yong Tang updated SPARK-7155: - Description: SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) will throw org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. The problem is that newAPIHadoopFile() use addInputPath() to add the file path into NewHadoopRDD. See Ln 928-931, master branch: val job = new NewHadoopJob(conf) NewFileInputFormat.addInputPath(job, new Path(path)) val updatedConf = job.getConfiguration new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path) Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will resolve this issue. was: SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) will throw org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. The problem is that newAPIHadoopFile() use addInputPath() to add the file path into NewHadoopRDD. See Ln 928-931, master branch: val job = new NewHadoopJob(conf) NewFileInputFormat.addInputPath(job, new Path(path)) val updatedConf = job.getConfiguration new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path) Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will resolve this issue. > SparkContext's newAPIHadoopFile does not support comma-separated list of > files, but the other API hadoopFile does. > -- > > Key: SPARK-7155 > URL: https://issues.apache.org/jira/browse/SPARK-7155 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 > Environment: Ubuntu 14.04 >Reporter: Yong Tang > > SparkContext's newAPIHadoopFile() does not support comma-separated list of > files. For example, the following: > sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", > classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) > will throw > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: file:/root/file1.txt,/root/file2.txt > However, the other API hadoopFile() is able to process comma-separated list > of files correctly. > In addition, since sc.textFile() uses hadoopFile(), it is also able to > process comma-separated list of files correctly. > The problem is that newAPIHadoopFile() use addInputPath() to add the file > path into NewHadoopRDD. See Ln 928-931, master branch: > val job = new NewHadoopJob(conf) > NewFileInputFormat.addInputPath(job, new Path(path)) > val updatedConf = job.getConfiguration > new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path) > Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will > resolve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7155) SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does.
Yong Tang created SPARK-7155: Summary: SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does. Key: SPARK-7155 URL: https://issues.apache.org/jira/browse/SPARK-7155 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Environment: Ubuntu 14.04 Reporter: Yong Tang SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) will throw org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. The problem is that newAPIHadoopFile() use addInputPath() to add the file path into NewHadoopRDD. See Ln 928-931, master branch: val job = new NewHadoopJob(conf) NewFileInputFormat.addInputPath(job, new Path(path)) val updatedConf = job.getConfiguration new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path) Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will resolve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org