[jira] [Commented] (SPARK-19975) Add map_keys and map_values functions to Python

2017-03-16 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929342#comment-15929342
 ] 

Yong Tang commented on SPARK-19975:
---

Created a PR for that:
https://github.com/apache/spark/pull/17328

Please take a look.

> Add map_keys and map_values functions  to Python 
> -
>
> Key: SPARK-19975
> URL: https://issues.apache.org/jira/browse/SPARK-19975
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Bryński
>
> We have `map_keys` and `map_values` functions in SQL.
> There is no Python equivalent functions for that.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-24 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883417#comment-15883417
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the reminder. I will take a look and update the PR as 
needed. (I am on the road until next Wednesday. Will try to get it by the end 
of next week.)

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245084#comment-15245084
 ] 

Yong Tang edited comment on SPARK-14409 at 4/18/16 3:12 AM:


Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
{code}
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
{code}
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).


was (Author: yongtang):
Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
`
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
`
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245084#comment-15245084
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
`
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
`
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-13 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240607#comment-15240607
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] [~josephkb]. Yes I think wrapping RankingMetrics could be the 
first step and reimplementing all RankingEvaluator methods in ML using 
DataFrames would be good after that. I will work on the reimplementation in 
several followup PRs.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14565) RandomForest should use parseInt and parseDouble for feature subset size instead of regexes

2016-04-13 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239396#comment-15239396
 ] 

Yong Tang commented on SPARK-14565:
---

Hi [~mengxr], I created a pull request to change regex to parseInt and 
parseDouble:
https://github.com/apache/spark/pull/12360
Please let me know if there are any issues.

> RandomForest should use parseInt and parseDouble for feature subset size 
> instead of regexes
> ---
>
> Key: SPARK-14565
> URL: https://issues.apache.org/jira/browse/SPARK-14565
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yong Tang
>
> Using regex is not robust and hard to maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-12 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238462#comment-15238462
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the review. I was planning to add MRR to RankingMetrics 
and then wrap that as a first step. But if you think it makes sense, I can 
reimplement from scratch. Please let me know which way would be better and I 
will move forward with it. Thanks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14531) Flume streaming should respect maxRate (and backpressure)

2016-04-12 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237864#comment-15237864
 ] 

Yong Tang commented on SPARK-14531:
---

Thanks [~hermansc], I noticed that my previous understanding may not be 
correct. Let me do some further investigation and see what I could do to update 
the pull request.

> Flume streaming should respect maxRate (and backpressure)
> -
>
> Key: SPARK-14531
> URL: https://issues.apache.org/jira/browse/SPARK-14531
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Herman Schistad
>Priority: Minor
>
> As far as I can understand the FlumeUtils.createPollingStream(...) ignores 
> key spark streaming configuration options such as:
> spark.streaming.backpressure.enabled
> spark.streaming.receiver.maxRate
> ...
> I might just be mistaken, but since the DEFAULT_POLLING_BATCH_SIZE is set to 
> 1000 in the source code itself, then I presume it should use the variables 
> above instead.
> *Relevant code:*  
> https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14546) Scale Wrapper in SparkR

2016-04-11 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236556#comment-15236556
 ] 

Yong Tang commented on SPARK-14546:
---

[~aloknsingh] I can work on this one if no one has started yet. Thanks.

> Scale Wrapper in SparkR
> ---
>
> Key: SPARK-14546
> URL: https://issues.apache.org/jira/browse/SPARK-14546
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Alok Singh
>
> ML has the StandardScaler and that seems like very commonly used.
> This jira is to implement the SparkR wrapper for it .
> Here is the R scale command
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-11 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236541#comment-15236541
 ] 

Yong Tang commented on SPARK-14409:
---

[~mlnick] [~josephkb] I added a short doc in google driver with comment enabled:
https://docs.google.com/document/d/1YEvf5eEm2vRcALJs39yICWmUx6xFW5j8DvXFWbRbStE/edit?usp=sharing
Please let me know if there is any feedback. Thanks

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14531) Flume streaming should respect maxRate (and backpressure)

2016-04-11 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235797#comment-15235797
 ] 

Yong Tang commented on SPARK-14531:
---

Hi [~hermansc] I created a pull request 
https://github.com/apache/spark/pull/12305 to allowing maxRate to be passed in 
conf. Is this something you expect? Thanks.

> Flume streaming should respect maxRate (and backpressure)
> -
>
> Key: SPARK-14531
> URL: https://issues.apache.org/jira/browse/SPARK-14531
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Herman Schistad
>Priority: Minor
>
> As far as I can understand the FlumeUtils.createPollingStream(...) ignores 
> key spark streaming configuration options such as:
> spark.streaming.backpressure.enabled
> spark.streaming.receiver.maxRate
> ...
> I might just be mistaken, but since the DEFAULT_POLLING_BATCH_SIZE is set to 
> 1000 in the source code itself, then I presume it should use the variables 
> above instead.
> *Relevant code:*  
> https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-06 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228741#comment-15228741
 ] 

Yong Tang commented on SPARK-14409:
---

[~josephkb] Sure. Let me do some investigation on other libraries then I will 
add a design doc.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-05 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227743#comment-15227743
 ] 

Yong Tang commented on SPARK-14409:
---

[~mlnick] I can work on this issue if no one has started yet. Thanks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225471#comment-15225471
 ] 

Yong Tang commented on SPARK-14368:
---

That is like an easy fix. Will create a pull request shortly.

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Priority: Trivial
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14335) Describe function command returns wrong output because some of built-in functions are not in function registry.

2016-04-02 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222761#comment-15222761
 ] 

Yong Tang commented on SPARK-14335:
---

I can work on this one. Will provide a pull request shortly.

> Describe function command returns wrong output because some of built-in 
> functions are not in function registry.
> ---
>
> Key: SPARK-14335
> URL: https://issues.apache.org/jira/browse/SPARK-14335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> {code}
> %sql describe function `and`
> unction: and
> Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd
> Usage: a and b - Logical and
> {code}
> The output still shows Hive's function because {{and}} is not in our 
> FunctionRegistry. Here is a list of such kind of commands
> {code}
> -
> !
> !=
> *
> /
> &
> %
> ^
> +
> <
> <=
> <=>
> <>
> =
> ==
> >
> >=
> |
> ~
> and
> between
> case
> in
> like
> not
> or
> rlike
> when
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14301) Java examples code merge and clean up

2016-03-31 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220240#comment-15220240
 ] 

Yong Tang commented on SPARK-14301:
---

Hi [~yinxusen] would you mind if I work on this issue? Thanks.

> Java examples code merge and clean up
> -
>
> Key: SPARK-14301
> URL: https://issues.apache.org/jira/browse/SPARK-14301
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in java/examples/mllib and java/examples/ml:
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * Unsure code duplications of java/ml, double check
> ** JavaDeveloperApiExample.java
> ** JavaSimpleParamsExample.java
> ** JavaSimpleTextClassificationPipeline.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * Unsure code duplications of java/mllib, double check
> ** JavaALS.java
> ** JavaFPGrowthExample.java
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib

2016-03-30 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219289#comment-15219289
 ] 

Yong Tang commented on SPARK-14238:
---

Hi [~mlnick], I created a pull request:
https://github.com/apache/spark/pull/12079
Let me know if you find any issues or there is anything I need to change. 
Thanks.

> Add binary toggle Param to PySpark HashingTF in ML & MLlib
> --
>
> Key: SPARK-14238
> URL: https://issues.apache.org/jira/browse/SPARK-14238
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14183) UnsupportedOperationException: empty.max when fitting CrossValidator model

2016-03-29 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216206#comment-15216206
 ] 

Yong Tang commented on SPARK-14183:
---

With the latest master build the message changes to:
{code}
scala> val model = cv.fit(df)
16/03/29 15:47:29 WARN LogisticRegression: All labels are zero and 
fitIntercept=true, so the coefficients will be zeros and the intercept will be 
negative infinity; as a result, training is not needed.
java.lang.IllegalArgumentException: requirement failed: Nothing has been added 
to this summarizer.
  at scala.Predef$.require(Predef.scala:219)
  at 
org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.normL2(MultivariateOnlineSummarizer.scala:270)
  at 
org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr$lzycompute(RegressionMetrics.scala:65)
  at 
org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr(RegressionMetrics.scala:65)
  at 
org.apache.spark.mllib.evaluation.RegressionMetrics.meanSquaredError(RegressionMetrics.scala:99)
  at 
org.apache.spark.mllib.evaluation.RegressionMetrics.rootMeanSquaredError(RegressionMetrics.scala:108)
  at 
org.apache.spark.ml.evaluation.RegressionEvaluator.evaluate(RegressionEvaluator.scala:94)
  at 
org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110)
  at 
org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:100)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:100)
  ... 55 elided

{code}

That looks much better than {noformat}UnsupportedOperationException: 
empty.max{noformat}

> UnsupportedOperationException: empty.max when fitting CrossValidator model 
> ---
>
> Key: SPARK-14183
> URL: https://issues.apache.org/jira/browse/SPARK-14183
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> The following code produces {{java.lang.UnsupportedOperationException: 
> empty.max}}, but it should've said what might've caused that or how to fix it.
> The exception:
> {code}
> scala> val model = cv.fit(df)
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:227)
>   at scala.collection.AbstractTraversable.max(Traversable.scala:104)
>   at 
> org.apache.spark.ml.classification.MultiClassSummarizer.numClasses(LogisticRegression.scala:739)
>   at 
> org.apache.spark.ml.classification.MultiClassSummarizer.histogram(LogisticRegression.scala:743)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:288)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:261)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:160)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
>   at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.ml.Estimator.fit(Estimator.scala:78)
>   at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110)
>   at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:105)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:105)
>   ... 55 elided
> {code}
> The code:
> {code}
> import org.apache.spark.ml.tuning._
> val cv = new CrossValidator
> import org.apache.spark.mllib.linalg._
> val features = Vectors.sparse(3, Array(1), Array(1d))
> val df = Seq((0, "hello world", 0d, features)).toDF("id", "text", "label", 
> "features")
> import org.apache.spark.ml.classification._
> val lr = new LogisticRegression()
> import org.apache.spark.ml.evaluation.RegressionEvaluator
> 

[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib

2016-03-29 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216121#comment-15216121
 ] 

Yong Tang commented on SPARK-14238:
---

Hi [~mlnick], do you mind if I work on this issue?

> Add binary toggle Param to PySpark HashingTF in ML & MLlib
> --
>
> Key: SPARK-14238
> URL: https://issues.apache.org/jira/browse/SPARK-14238
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3724) RandomForest: More options for feature subset size

2016-03-27 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213667#comment-15213667
 ] 

Yong Tang commented on SPARK-3724:
--

[~josephkb] I just created a pull request for this issue. Let me know if there 
are any issues.


> RandomForest: More options for feature subset size
> --
>
> Key: SPARK-3724
> URL: https://issues.apache.org/jira/browse/SPARK-3724
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> RandomForest currently supports using a few values for the number of features 
> to sample per node: all, sqrt, log2, etc.  It should support any given value 
> (to allow model search).
> Proposal: If the parameter for specifying the number of features per node is 
> not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a 
> numerical value.  The value should be either (a) a real value in [0,1] 
> specifying the fraction of features in each subset or (b) an integer value 
> specifying the number of features in each subset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3724) RandomForest: More options for feature subset size

2016-03-27 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213517#comment-15213517
 ] 

Yong Tang commented on SPARK-3724:
--

I can work on this one. Will create a PR soon.

> RandomForest: More options for feature subset size
> --
>
> Key: SPARK-3724
> URL: https://issues.apache.org/jira/browse/SPARK-3724
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> RandomForest currently supports using a few values for the number of features 
> to sample per node: all, sqrt, log2, etc.  It should support any given value 
> (to allow model search).
> Proposal: If the parameter for specifying the number of features per node is 
> not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a 
> numerical value.  The value should be either (a) a real value in [0,1] 
> specifying the fraction of features in each subset or (b) an integer value 
> specifying the number of features in each subset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8954) Building Docker Images Fails in 1.4 branch

2015-07-10 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622780#comment-14622780
 ] 

Yong Tang commented on SPARK-8954:
--

This failure could be fixed by removing
RUN echo deb http://archive.ubuntu.com/ubuntu precise main universe  
/etc/apt/sources.list
from the Dockerfile.

A Pull Request 
https://github.com/apache/spark/pull/7346
is created. This Pull request also removes /var/lib/apt/lists/* at the end of 
the package install i (in docker), which saves about ~30MB of docker images 
size.

 Building Docker Images Fails in 1.4 branch
 --

 Key: SPARK-8954
 URL: https://issues.apache.org/jira/browse/SPARK-8954
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
 Environment: Docker
Reporter: Pradeep Bashyal

 Docker build on branch 1.4 fails when installing the jdk. It expects 
 tzdata-java as a dependency but adding that to the apt-get install list 
 doesn't help.
 ~/S/s/d/spark-test git:branch-1.4 ❯❯❯ docker build -t spark-test-base base/   
   ◼
 Sending build context to Docker daemon 3.072 kB
 Sending build context to Docker daemon
 Step 0 : FROM ubuntu:precise
  --- 78cef618c77e
 Step 1 : RUN echo deb http://archive.ubuntu.com/ubuntu precise main 
 universe  /etc/apt/sources.list
  --- Using cache
  --- 2017472bec85
 Step 2 : RUN apt-get update
  --- Using cache
  --- 86b8911ead16
 Step 3 : RUN apt-get install -y less openjdk-7-jre-headless net-tools 
 vim-tiny sudo openssh-server
  --- Running in dc8197a0ea31
 Reading package lists...
 Building dependency tree...
 Reading state information...
 Some packages could not be installed. This may mean that you have
 requested an impossible situation or if you are using the unstable
 distribution that some required packages have not yet been created
 or been moved out of Incoming.
 The following information may help to resolve the situation:
 The following packages have unmet dependencies:
  openjdk-7-jre-headless : Depends: tzdata-java but it is not going to be 
 installed
 E: Unable to correct problems, you have held broken packages.
 INFO[0004] The command [/bin/sh -c apt-get install -y less 
 openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a 
 non-zero code: 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7155) SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does.

2015-04-26 Thread Yong Tang (JIRA)
Yong Tang created SPARK-7155:


 Summary: SparkContext's newAPIHadoopFile does not support 
comma-separated list of files, but the other API hadoopFile does.
 Key: SPARK-7155
 URL: https://issues.apache.org/jira/browse/SPARK-7155
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
 Environment: Ubuntu 14.04
Reporter: Yong Tang


SparkContext's newAPIHadoopFile() does not support comma-separated list of 
files. For example, the following:

sc.newAPIHadoopFile(/root/file1.txt,/root/file2.txt, 
classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

will throw

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/root/file1.txt,/root/file2.txt

However, the other API hadoopFile() is able to process comma-separated list of 
files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also 
able to process comma-separated list of files correctly.

The problem is that newAPIHadoopFile() use addInputPath() to add the file path 
into NewHadoopRDD. See Ln 928-931, master branch:
val job = new NewHadoopJob(conf)
NewFileInputFormat.addInputPath(job, new Path(path))
val updatedConf = job.getConfiguration
new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)

Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will 
resolve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7155) SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does.

2015-04-26 Thread Yong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Tang updated SPARK-7155:
-
Description: 
SparkContext's newAPIHadoopFile() does not support comma-separated list of 
files. For example, the following:

sc.newAPIHadoopFile(/root/file1.txt,/root/file2.txt, 
classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

will throw

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/root/file1.txt,/root/file2.txt

However, the other API hadoopFile() is able to process comma-separated list of 
files correctly.

In addition, since sc.textFile() uses hadoopFile(), it is also able to process 
comma-separated list of files correctly.

The problem is that newAPIHadoopFile() use addInputPath() to add the file path 
into NewHadoopRDD. See Ln 928-931, master branch:
val job = new NewHadoopJob(conf)
NewFileInputFormat.addInputPath(job, new Path(path))
val updatedConf = job.getConfiguration
new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)

Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will 
resolve this issue.

  was:
SparkContext's newAPIHadoopFile() does not support comma-separated list of 
files. For example, the following:

sc.newAPIHadoopFile(/root/file1.txt,/root/file2.txt, 
classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

will throw

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/root/file1.txt,/root/file2.txt

However, the other API hadoopFile() is able to process comma-separated list of 
files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also 
able to process comma-separated list of files correctly.

The problem is that newAPIHadoopFile() use addInputPath() to add the file path 
into NewHadoopRDD. See Ln 928-931, master branch:
val job = new NewHadoopJob(conf)
NewFileInputFormat.addInputPath(job, new Path(path))
val updatedConf = job.getConfiguration
new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)

Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will 
resolve this issue.


 SparkContext's newAPIHadoopFile does not support comma-separated list of 
 files, but the other API hadoopFile does.
 --

 Key: SPARK-7155
 URL: https://issues.apache.org/jira/browse/SPARK-7155
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
 Environment: Ubuntu 14.04
Reporter: Yong Tang

 SparkContext's newAPIHadoopFile() does not support comma-separated list of 
 files. For example, the following:
 sc.newAPIHadoopFile(/root/file1.txt,/root/file2.txt, 
 classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
 will throw
 org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
 not exist: file:/root/file1.txt,/root/file2.txt
 However, the other API hadoopFile() is able to process comma-separated list 
 of files correctly.
 In addition, since sc.textFile() uses hadoopFile(), it is also able to 
 process comma-separated list of files correctly.
 The problem is that newAPIHadoopFile() use addInputPath() to add the file 
 path into NewHadoopRDD. See Ln 928-931, master branch:
 val job = new NewHadoopJob(conf)
 NewFileInputFormat.addInputPath(job, new Path(path))
 val updatedConf = job.getConfiguration
 new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)
 Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will 
 resolve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org