[jira] [Commented] (SPARK-19975) Add map_keys and map_values functions to Python

2017-03-16 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929342#comment-15929342
 ] 

Yong Tang commented on SPARK-19975:
---

Created a PR for that:
https://github.com/apache/spark/pull/17328

Please take a look.

> Add map_keys and map_values functions  to Python 
> -
>
> Key: SPARK-19975
> URL: https://issues.apache.org/jira/browse/SPARK-19975
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Bryński
>
> We have `map_keys` and `map_values` functions in SQL.
> There is no Python equivalent functions for that.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-24 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883417#comment-15883417
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the reminder. I will take a look and update the PR as 
needed. (I am on the road until next Wednesday. Will try to get it by the end 
of next week.)

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084
 ] 

Yong Tang edited comment on SPARK-14409 at 4/18/16 3:12 AM:


Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
{code}
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
{code}
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).


was (Author: yongtang):
Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
`
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
`
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
`
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
`
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-13 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240607#comment-15240607
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] [~josephkb]. Yes I think wrapping RankingMetrics could be the 
first step and reimplementing all RankingEvaluator methods in ML using 
DataFrames would be good after that. I will work on the reimplementation in 
several followup PRs.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14565) RandomForest should use parseInt and parseDouble for feature subset size instead of regexes

2016-04-13 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239396#comment-15239396
 ] 

Yong Tang commented on SPARK-14565:
---

Hi [~mengxr], I created a pull request to change regex to parseInt and 
parseDouble:
https://github.com/apache/spark/pull/12360
Please let me know if there are any issues.

> RandomForest should use parseInt and parseDouble for feature subset size 
> instead of regexes
> ---
>
> Key: SPARK-14565
> URL: https://issues.apache.org/jira/browse/SPARK-14565
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yong Tang
>
> Using regex is not robust and hard to maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-12 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238462#comment-15238462
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the review. I was planning to add MRR to RankingMetrics 
and then wrap that as a first step. But if you think it makes sense, I can 
reimplement from scratch. Please let me know which way would be better and I 
will move forward with it. Thanks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14531) Flume streaming should respect maxRate (and backpressure)

2016-04-12 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237864#comment-15237864
 ] 

Yong Tang commented on SPARK-14531:
---

Thanks [~hermansc], I noticed that my previous understanding may not be 
correct. Let me do some further investigation and see what I could do to update 
the pull request.

> Flume streaming should respect maxRate (and backpressure)
> -
>
> Key: SPARK-14531
> URL: https://issues.apache.org/jira/browse/SPARK-14531
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Herman Schistad
>Priority: Minor
>
> As far as I can understand the FlumeUtils.createPollingStream(...) ignores 
> key spark streaming configuration options such as:
> spark.streaming.backpressure.enabled
> spark.streaming.receiver.maxRate
> ...
> I might just be mistaken, but since the DEFAULT_POLLING_BATCH_SIZE is set to 
> 1000 in the source code itself, then I presume it should use the variables 
> above instead.
> *Relevant code:*  
> https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14546) Scale Wrapper in SparkR

2016-04-11 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236556#comment-15236556
 ] 

Yong Tang commented on SPARK-14546:
---

[~aloknsingh] I can work on this one if no one has started yet. Thanks.

> Scale Wrapper in SparkR
> ---
>
> Key: SPARK-14546
> URL: https://issues.apache.org/jira/browse/SPARK-14546
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Alok Singh
>
> ML has the StandardScaler and that seems like very commonly used.
> This jira is to implement the SparkR wrapper for it .
> Here is the R scale command
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-11 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236541#comment-15236541
 ] 

Yong Tang commented on SPARK-14409:
---

[~mlnick] [~josephkb] I added a short doc in google driver with comment enabled:
https://docs.google.com/document/d/1YEvf5eEm2vRcALJs39yICWmUx6xFW5j8DvXFWbRbStE/edit?usp=sharing
Please let me know if there is any feedback. Thanks

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14531) Flume streaming should respect maxRate (and backpressure)

2016-04-11 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235797#comment-15235797
 ] 

Yong Tang commented on SPARK-14531:
---

Hi [~hermansc] I created a pull request 
https://github.com/apache/spark/pull/12305 to allowing maxRate to be passed in 
conf. Is this something you expect? Thanks.

> Flume streaming should respect maxRate (and backpressure)
> -
>
> Key: SPARK-14531
> URL: https://issues.apache.org/jira/browse/SPARK-14531
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Herman Schistad
>Priority: Minor
>
> As far as I can understand the FlumeUtils.createPollingStream(...) ignores 
> key spark streaming configuration options such as:
> spark.streaming.backpressure.enabled
> spark.streaming.receiver.maxRate
> ...
> I might just be mistaken, but since the DEFAULT_POLLING_BATCH_SIZE is set to 
> 1000 in the source code itself, then I presume it should use the variables 
> above instead.
> *Relevant code:*  
> https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-06 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228741#comment-15228741
 ] 

Yong Tang commented on SPARK-14409:
---

[~josephkb] Sure. Let me do some investigation on other libraries then I will 
add a design doc.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-05 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227743#comment-15227743
 ] 

Yong Tang commented on SPARK-14409:
---

[~mlnick] I can work on this issue if no one has started yet. Thanks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14368) Support python.spark.worker.memory with upper-case unit

2016-04-04 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225471#comment-15225471
 ] 

Yong Tang commented on SPARK-14368:
---

That is like an easy fix. Will create a pull request shortly.

> Support python.spark.worker.memory with upper-case unit
> ---
>
> Key: SPARK-14368
> URL: https://issues.apache.org/jira/browse/SPARK-14368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Masahiro TANAKA
>Priority: Trivial
>
> According to the 
> [document|https://spark.apache.org/docs/latest/configuration.html], 
> spark.python.worker.memory is in the same format as JVM memory string. But 
> upper-case unit is not allowed in `spark.python.worker.memory`. It should be 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14335) Describe function command returns wrong output because some of built-in functions are not in function registry.

2016-04-02 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222761#comment-15222761
 ] 

Yong Tang commented on SPARK-14335:
---

I can work on this one. Will provide a pull request shortly.

> Describe function command returns wrong output because some of built-in 
> functions are not in function registry.
> ---
>
> Key: SPARK-14335
> URL: https://issues.apache.org/jira/browse/SPARK-14335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> {code}
> %sql describe function `and`
> unction: and
> Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd
> Usage: a and b - Logical and
> {code}
> The output still shows Hive's function because {{and}} is not in our 
> FunctionRegistry. Here is a list of such kind of commands
> {code}
> -
> !
> !=
> *
> /
> &
> %
> ^
> +
> <
> <=
> <=>
> <>
> =
> ==
> >
> >=
> |
> ~
> and
> between
> case
> in
> like
> not
> or
> rlike
> when
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14301) Java examples code merge and clean up

2016-03-31 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220240#comment-15220240
 ] 

Yong Tang commented on SPARK-14301:
---

Hi [~yinxusen] would you mind if I work on this issue? Thanks.

> Java examples code merge and clean up
> -
>
> Key: SPARK-14301
> URL: https://issues.apache.org/jira/browse/SPARK-14301
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in java/examples/mllib and java/examples/ml:
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * Unsure code duplications of java/ml, double check
> ** JavaDeveloperApiExample.java
> ** JavaSimpleParamsExample.java
> ** JavaSimpleTextClassificationPipeline.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * Unsure code duplications of java/mllib, double check
> ** JavaALS.java
> ** JavaFPGrowthExample.java
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib

2016-03-30 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219289#comment-15219289
 ] 

Yong Tang commented on SPARK-14238:
---

Hi [~mlnick], I created a pull request:
https://github.com/apache/spark/pull/12079
Let me know if you find any issues or there is anything I need to change. 
Thanks.

> Add binary toggle Param to PySpark HashingTF in ML & MLlib
> --
>
> Key: SPARK-14238
> URL: https://issues.apache.org/jira/browse/SPARK-14238
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14183) UnsupportedOperationException: empty.max when fitting CrossValidator model

2016-03-29 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216206#comment-15216206
 ] 

Yong Tang commented on SPARK-14183:
---

With the latest master build the message changes to:
{code}
scala> val model = cv.fit(df)
16/03/29 15:47:29 WARN LogisticRegression: All labels are zero and 
fitIntercept=true, so the coefficients will be zeros and the intercept will be 
negative infinity; as a result, training is not needed.
java.lang.IllegalArgumentException: requirement failed: Nothing has been added 
to this summarizer.
  at scala.Predef$.require(Predef.scala:219)
  at 
org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.normL2(MultivariateOnlineSummarizer.scala:270)
  at 
org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr$lzycompute(RegressionMetrics.scala:65)
  at 
org.apache.spark.mllib.evaluation.RegressionMetrics.SSerr(RegressionMetrics.scala:65)
  at 
org.apache.spark.mllib.evaluation.RegressionMetrics.meanSquaredError(RegressionMetrics.scala:99)
  at 
org.apache.spark.mllib.evaluation.RegressionMetrics.rootMeanSquaredError(RegressionMetrics.scala:108)
  at 
org.apache.spark.ml.evaluation.RegressionEvaluator.evaluate(RegressionEvaluator.scala:94)
  at 
org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110)
  at 
org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:100)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:100)
  ... 55 elided

{code}

That looks much better than {noformat}UnsupportedOperationException: 
empty.max{noformat}

> UnsupportedOperationException: empty.max when fitting CrossValidator model 
> ---
>
> Key: SPARK-14183
> URL: https://issues.apache.org/jira/browse/SPARK-14183
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> The following code produces {{java.lang.UnsupportedOperationException: 
> empty.max}}, but it should've said what might've caused that or how to fix it.
> The exception:
> {code}
> scala> val model = cv.fit(df)
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:227)
>   at scala.collection.AbstractTraversable.max(Traversable.scala:104)
>   at 
> org.apache.spark.ml.classification.MultiClassSummarizer.numClasses(LogisticRegression.scala:739)
>   at 
> org.apache.spark.ml.classification.MultiClassSummarizer.histogram(LogisticRegression.scala:743)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:288)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:261)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:160)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
>   at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.ml.Estimator.fit(Estimator.scala:78)
>   at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110)
>   at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:105)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:105)
>   ... 55 elided
> {code}
> The code:
> {code}
> import org.apache.spark.ml.tuning._
> val cv = new CrossValidator
> import org.apache.spark.mllib.linalg._
> val features = Vectors.sparse(3, Array(1), Array(1d))
> val df = Seq((0, "hello world", 0d, features)).toDF("id", "text", "label", 
> "features")
> import org.apache.spark.ml.classification._
> val lr = new LogisticRegression()
> import org.apache.spark.ml.evaluation.Regressi

[jira] [Commented] (SPARK-14238) Add binary toggle Param to PySpark HashingTF in ML & MLlib

2016-03-29 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216121#comment-15216121
 ] 

Yong Tang commented on SPARK-14238:
---

Hi [~mlnick], do you mind if I work on this issue?

> Add binary toggle Param to PySpark HashingTF in ML & MLlib
> --
>
> Key: SPARK-14238
> URL: https://issues.apache.org/jira/browse/SPARK-14238
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3724) RandomForest: More options for feature subset size

2016-03-27 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213667#comment-15213667
 ] 

Yong Tang commented on SPARK-3724:
--

[~josephkb] I just created a pull request for this issue. Let me know if there 
are any issues.


> RandomForest: More options for feature subset size
> --
>
> Key: SPARK-3724
> URL: https://issues.apache.org/jira/browse/SPARK-3724
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> RandomForest currently supports using a few values for the number of features 
> to sample per node: all, sqrt, log2, etc.  It should support any given value 
> (to allow model search).
> Proposal: If the parameter for specifying the number of features per node is 
> not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a 
> numerical value.  The value should be either (a) a real value in [0,1] 
> specifying the fraction of features in each subset or (b) an integer value 
> specifying the number of features in each subset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3724) RandomForest: More options for feature subset size

2016-03-27 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213517#comment-15213517
 ] 

Yong Tang commented on SPARK-3724:
--

I can work on this one. Will create a PR soon.

> RandomForest: More options for feature subset size
> --
>
> Key: SPARK-3724
> URL: https://issues.apache.org/jira/browse/SPARK-3724
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> RandomForest currently supports using a few values for the number of features 
> to sample per node: all, sqrt, log2, etc.  It should support any given value 
> (to allow model search).
> Proposal: If the parameter for specifying the number of features per node is 
> not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a 
> numerical value.  The value should be either (a) a real value in [0,1] 
> specifying the fraction of features in each subset or (b) an integer value 
> specifying the number of features in each subset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8954) Building Docker Images Fails in 1.4 branch

2015-07-10 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622780#comment-14622780
 ] 

Yong Tang commented on SPARK-8954:
--

This failure could be fixed by removing
RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > 
/etc/apt/sources.list
from the Dockerfile.

A Pull Request 
https://github.com/apache/spark/pull/7346
is created. This Pull request also removes /var/lib/apt/lists/* at the end of 
the package install i (in docker), which saves about ~30MB of docker images 
size.

> Building Docker Images Fails in 1.4 branch
> --
>
> Key: SPARK-8954
> URL: https://issues.apache.org/jira/browse/SPARK-8954
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
> Environment: Docker
>Reporter: Pradeep Bashyal
>
> Docker build on branch 1.4 fails when installing the jdk. It expects 
> tzdata-java as a dependency but adding that to the apt-get install list 
> doesn't help.
> ~/S/s/d/spark-test git:branch-1.4 ❯❯❯ docker build -t spark-test-base base/   
>   ◼
> Sending build context to Docker daemon 3.072 kB
> Sending build context to Docker daemon
> Step 0 : FROM ubuntu:precise
>  ---> 78cef618c77e
> Step 1 : RUN echo "deb http://archive.ubuntu.com/ubuntu precise main 
> universe" > /etc/apt/sources.list
>  ---> Using cache
>  ---> 2017472bec85
> Step 2 : RUN apt-get update
>  ---> Using cache
>  ---> 86b8911ead16
> Step 3 : RUN apt-get install -y less openjdk-7-jre-headless net-tools 
> vim-tiny sudo openssh-server
>  ---> Running in dc8197a0ea31
> Reading package lists...
> Building dependency tree...
> Reading state information...
> Some packages could not be installed. This may mean that you have
> requested an impossible situation or if you are using the unstable
> distribution that some required packages have not yet been created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
> The following packages have unmet dependencies:
>  openjdk-7-jre-headless : Depends: tzdata-java but it is not going to be 
> installed
> E: Unable to correct problems, you have held broken packages.
> INFO[0004] The command [/bin/sh -c apt-get install -y less 
> openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a 
> non-zero code: 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7155) SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does.

2015-04-26 Thread Yong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Tang updated SPARK-7155:
-
Description: 
SparkContext's newAPIHadoopFile() does not support comma-separated list of 
files. For example, the following:

sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", 
classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

will throw

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/root/file1.txt,/root/file2.txt

However, the other API hadoopFile() is able to process comma-separated list of 
files correctly.

In addition, since sc.textFile() uses hadoopFile(), it is also able to process 
comma-separated list of files correctly.

The problem is that newAPIHadoopFile() use addInputPath() to add the file path 
into NewHadoopRDD. See Ln 928-931, master branch:
val job = new NewHadoopJob(conf)
NewFileInputFormat.addInputPath(job, new Path(path))
val updatedConf = job.getConfiguration
new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)

Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will 
resolve this issue.

  was:
SparkContext's newAPIHadoopFile() does not support comma-separated list of 
files. For example, the following:

sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", 
classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

will throw

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/root/file1.txt,/root/file2.txt

However, the other API hadoopFile() is able to process comma-separated list of 
files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also 
able to process comma-separated list of files correctly.

The problem is that newAPIHadoopFile() use addInputPath() to add the file path 
into NewHadoopRDD. See Ln 928-931, master branch:
val job = new NewHadoopJob(conf)
NewFileInputFormat.addInputPath(job, new Path(path))
val updatedConf = job.getConfiguration
new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)

Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will 
resolve this issue.


> SparkContext's newAPIHadoopFile does not support comma-separated list of 
> files, but the other API hadoopFile does.
> --
>
> Key: SPARK-7155
> URL: https://issues.apache.org/jira/browse/SPARK-7155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
> Environment: Ubuntu 14.04
>Reporter: Yong Tang
>
> SparkContext's newAPIHadoopFile() does not support comma-separated list of 
> files. For example, the following:
> sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", 
> classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
> will throw
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: file:/root/file1.txt,/root/file2.txt
> However, the other API hadoopFile() is able to process comma-separated list 
> of files correctly.
> In addition, since sc.textFile() uses hadoopFile(), it is also able to 
> process comma-separated list of files correctly.
> The problem is that newAPIHadoopFile() use addInputPath() to add the file 
> path into NewHadoopRDD. See Ln 928-931, master branch:
> val job = new NewHadoopJob(conf)
> NewFileInputFormat.addInputPath(job, new Path(path))
> val updatedConf = job.getConfiguration
> new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)
> Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will 
> resolve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7155) SparkContext's newAPIHadoopFile does not support comma-separated list of files, but the other API hadoopFile does.

2015-04-26 Thread Yong Tang (JIRA)
Yong Tang created SPARK-7155:


 Summary: SparkContext's newAPIHadoopFile does not support 
comma-separated list of files, but the other API hadoopFile does.
 Key: SPARK-7155
 URL: https://issues.apache.org/jira/browse/SPARK-7155
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
 Environment: Ubuntu 14.04
Reporter: Yong Tang


SparkContext's newAPIHadoopFile() does not support comma-separated list of 
files. For example, the following:

sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", 
classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

will throw

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/root/file1.txt,/root/file2.txt

However, the other API hadoopFile() is able to process comma-separated list of 
files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also 
able to process comma-separated list of files correctly.

The problem is that newAPIHadoopFile() use addInputPath() to add the file path 
into NewHadoopRDD. See Ln 928-931, master branch:
val job = new NewHadoopJob(conf)
NewFileInputFormat.addInputPath(job, new Path(path))
val updatedConf = job.getConfiguration
new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)

Change addInputPath(job, new Path(path)) to addInputPaths(job, path) will 
resolve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org