[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2018-04-25 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18998
  
Close it since quite a long time without any activity. Thanks all the same


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-04-25 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18998


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-04-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r179903481
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

@sethah Hi, thank all for your review and comments. However, since it has 
been a quite long time with no activity, is it a good idea to close the PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

2018-03-03 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18736
  
Closed as #18998 takes too long to wait.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2018-03-03 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18736


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2018-03-03 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
Colsed since its duplicate PR #20632 has been merged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17503: [SPARK-3159][MLlib] Check for reducible DecisionT...

2018-03-03 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/17503


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-03-03 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r172015693
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

By the way, as the PR doesn't change the function's behavior, I think it 
should be correct if all tests pass, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-03-03 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r172015644
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

Sorry for the delay. Yes, original implement use mutable map in mllib 
package, which is different with the PR. However, I think the PR is more clear 
and their logic are the same. Which one is faster in your test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-02-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r171412547
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

Sorry, I can't remember all details exactly since the pr is too old. if my 
memory is correct, the ML implementation keep consistent with MLLIB (old one). 
As the "TODO" above said
> Make the hashingTF.transform natively in ml framework to avoid extra 
conversion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
Thank you, @WeichenXu123 . You can also use the condition "include the 
first bin" to filter left splits. Perhaps it is better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
In fact, I'm not sure whether the idea is right, so no hesitate to correct 
me. I assume the algorithm requires O(N^2) complexity. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
Hi, I write a demo with python. I'll be happy if it could be useful.

For N bins, say `[x_1, x_2, ..., x_N]`, since all its splits contain either 
`x_1` or not, so we can choose the half splits which doesn't contain x_1 as 
left splits.
If I understand it correctly, the left splits are indeed all combinations 
of the left bins, `[x_2, x_3, ... x_N]`. The problem can be solved by the 
[backtracking algorithm](https://en.wikipedia.org/wiki/Backtracking).

Please correct me if I'm wrong. Thanks very much.

```python
#!/usr/bin/env python

def gen_splits(bins):
if len(bins) == 1:
return bins
results = []
partial_res = []
gen_splits_iter(1, bins, partial_res, results)
return results


def gen_splits_iter(dep, bins, partial_res, results):
if partial_res:
left_splits = partial_res[:]
right_splits = [x for x in bins if x not in left_splits]
results.append("left: {:20}, right: {}".format(str(left_splits), 
right_splits))

for m in range(dep, len(bins)):
partial_res.append(bins[m])
gen_splits_iter(m+1, bins, partial_res, results)
partial_res.pop()


if __name__ == "__main__":
print("first example:")
bins = ["a", "b", "c"]
print("bins: {}\n-".format(bins))
splits = gen_splits(bins)
for s in splits:
print(s)

print("\n\n=")
print("second example:")
bins = ["a", "b", "c", "d", "e"]
print("bins: {}\n-".format(bins))
splits = gen_splits(bins)
for s in splits:
print(s)
```

logs:
```bash
~/Downloads ❯❯❯ python test.py
first example:
bins: ['a', 'b', 'c']
-
left: ['b']   , right: ['a', 'c']
left: ['b', 'c']  , right: ['a']
left: ['c']   , right: ['a', 'b']


=
second example:
bins: ['a', 'b', 'c', 'd', 'e']
-
left: ['b']   , right: ['a', 'c', 'd', 'e']
left: ['b', 'c']  , right: ['a', 'd', 'e']
left: ['b', 'c', 'd'] , right: ['a', 'e']
left: ['b', 'c', 'd', 'e'], right: ['a']
left: ['b', 'c', 'e'] , right: ['a', 'd']
left: ['b', 'd']  , right: ['a', 'c', 'e']
left: ['b', 'd', 'e'] , right: ['a', 'c']
left: ['b', 'e']  , right: ['a', 'c', 'd']
left: ['c']   , right: ['a', 'b', 'd', 'e']
left: ['c', 'd']  , right: ['a', 'b', 'e']
left: ['c', 'd', 'e'] , right: ['a', 'b']
left: ['c', 'e']  , right: ['a', 'b', 'd']
left: ['d']   , right: ['a', 'b', 'c', 'e']
left: ['d', 'e']  , right: ['a', 'b', 'c']
left: ['e']   , right: ['a', 'b', 'c', 'd']
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-07 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
I believe that unordered features will benefit a lot from the idea, however 
I have two questions:
1. I'm a little confused by 964L in `traverseUnorderedSplits`.  Is it a 
backtracking algorithm?
```scala
dfs(binIndex + 1, combNumber, stats) 
```
2. `traverseUnorderedSplits` succeed in decoupling an abstraction from its 
implementation. Cool! However, I just wonder whether we can write a simpler 
function, say, remove `seqOp`, `finalizer`, `T` and collect all logic together 
in one place?

Anyway, thanks for your good work, @WeichenXu123.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2017-11-07 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19666#discussion_r149313427
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -631,6 +614,42 @@ class RandomForestSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val expected = Map(0 -> 1.0 / 3.0, 2 -> 2.0 / 3.0)
 assert(mapToVec(map.toMap) ~== mapToVec(expected) relTol 0.01)
   }
+
+  test("traverseUnorderedSplits") {
+
--- End diff --

Since `traverseUnorderedSplits` is a private method, I wonder whether we 
can check the unorder splits on DecisonTree directly? For example, create a 
tiny dataset and generate a shallow tree (depth = 1?). I know the test case is 
difficult (maybe impossible) to design, however it focuses on behavior instead 
of implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-11-07 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18998
  
ping @yanboliang


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in ...

2017-09-26 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/17383


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17383
  
Hi, since the work has been done for a long time, I take a review by 
myself. 

After careful review, as SparseVector is compressed sparse row format, so 
the only benefit of the PR would be for data storage but in the cost of 
performance. But for tree-method, it is uncommon to handle a super large 
dimension features. Hence, it cannot satisfy me.

I prefer to [SPARK-3717: DecisionTree, RandomForest: Partition by 
feature](https://issues.apache.org/jira/browse/SPARK-3717) as an alternative, 
which will be benefits in both performance and storage if I understand 
correctly. So the PR is closed. Thank everyone for review / comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-09-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
HI, @WeichenXu123. 

As said by @srowen , the benefit of this would be for speed at predict time 
or for model storage. Hence I'm not sure whether benchmark is really need for 
the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-08 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17383
  
Sure, @WeichenXu123 , perhaps one or two weeks later, is it OK?

By the way, I think using sparse representation can only reduce memory 
usage, and it is in the cost of compute performance. Hence, it cannot speed up 
calculation for low-dimension data. The only advantage sparse representation 
has is for super large dimension in my opinion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-06 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17383
  
Thank you for comment. 

Very good question, at least for me, the answer to both questions is no. In 
most case, we feed dense raw data into tree model. However, if large dimensions 
required, I believe the sparse representation is necessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-08-29 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18998
  
Hi, @yanboliang and @srowen . Thanks for your comments. For HashingTF, I 
agree that it is necessary to migrate its implementation so that new method 
could be added easily. 

Thanks, any review will be appreciated. Hence, I'm glad to get feedback on 
the work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for...

2017-08-26 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18120


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-08-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
Hi, @yanboliang . Do you have time to take a look at first? Thanks very 
much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-08-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18998
  
The PR is dependent by #18736 .  To keep consistency of `setxxx` methods 
between scala and python , as @yanboliang suggested, it is better to migrate 
the HashingTF implementation from mllib to ml firstly, and then add `indexOf` 
function. More to see the discussion in  #18736.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-08-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18998
  
Hi, @srowen . Could you take a look at the PR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18998: [SPARK-21748][ML] Migrate the implementation of HashingT...

2017-08-18 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18998
  
cc @yanboliang @WeichenXu123 who I believe are interested in this PR. Could 
you take a look please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2017-08-18 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18998

[SPARK-21748][ML] Migrate the implementation of HashingTF from MLlib to ML

## What changes were proposed in this pull request?

Migrate the implementation of HashingTF from MLlib to ML.

## How was this patch tested?

+ [ ] Pass all unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/migrate_hash_tf_to_ml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18998.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18998


commit 6a25da2cc96a744aaf047280ac414e5ff4515434
Author: Yan Facai (颜发才) 
Date:   2017-08-19T02:24:14Z

ENH: implement HashingTF in ml




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

2017-08-15 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18736
  
Sure, @yanboliang . Thanks for your suggestion. I'll work on it later, 
perhaps next week. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

2017-08-13 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18736
  
@yanboliang Hi, yangbo. Could you help review the PR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-10 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r132618802
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
 
   /** @group setParam */
   @Since("1.2.0")
-  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
+  def setNumFeatures(value: Int): this.type = {
+hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
--- End diff --

@WeichenXu123  How about adding `setNumFeatures` method to mllib? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-09 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r132131171
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   def setBinary(value: Boolean): this.type = set(binary, value)
 
+  /**
+   * Returns the index of the input term.
+   */
+  @Since("2.3.0")
+  def indexOf(term: Any): Int = {
+if (hashingTF != null) {
+  hashingTF.indexOf(term)
--- End diff --

Thanks for you suggestion, I'll like to modify the code later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-08-08 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18763


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest should suppo...

2017-08-08 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18763
  
Thanks, all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-08 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Sure, thanks, @yanboliang !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-08-08 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18764


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-07 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Thanks, @yanboliang @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-06 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
@SparkQA Take a test, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-08-05 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18763#discussion_r131529768
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1423,7 +1425,18 @@ def _fit(self, dataset):
 
 numClasses = int(dataset.agg({labelCol: 
"max"}).head()["max("+labelCol+")"]) + 1
 
-multiclassLabeled = dataset.select(labelCol, featuresCol)
+weightCol = None
+if (self.isDefined(self.weightCol) and self.getWeightCol()):
+if isinstance(classifier, HasWeightCol):
+weightCol = self.getWeightCol()
+else:
+warnings.warn("weightCol is ignored, "
+  "as it is not supported by {} 
now.".format(classifier))
--- End diff --

Modified as suggested, thanks very much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-08-05 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18764#discussion_r131529693
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1344,7 +1346,19 @@ def _fit(self, dataset):
 
 numClasses = int(dataset.agg({labelCol: 
"max"}).head()["max("+labelCol+")"]) + 1
 
-multiclassLabeled = dataset.select(labelCol, featuresCol)
+weightCol = None
+if (self.isDefined(self.weightCol) and self.getWeightCol()):
+if isinstance(classifier, HasWeightCol):
+weightCol = self.getWeightCol()
+else:
+warnings.warn("weightCol is ignored, "
+  "as it is not supported by {} now.".format(
--- End diff --

Thank you very much for help, @yanboliang !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-04 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
@yanboliang Thanks, yanbo. I am not familar with python 2.6, which is too 
outdated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-04 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Test failures in pyspark.ml.tests with python2.6, but I don't have the 
environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-03 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Test failures in pyspark.ml.tests with python2.6, but I don't have the 
environment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-08-01 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest should suppo...

2017-07-31 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18764
  
Thanks, @yanboliang . Could you give a hand, @srowen ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18763#discussion_r130213337
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
   }
 
   test("SPARK-21306: OneVsRest should support setWeightCol") {
-val dataset2 = dataset.withColumn("weight", lit(1))
+val dataset2 = dataset.withColumn("weight", lit(1.0))
--- End diff --

OK. Could you test and merge the PR into branch 2.1? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch 2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18763#discussion_r130202540
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -158,7 +158,7 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
   }
 
   test("SPARK-21306: OneVsRest should support setWeightCol") {
-val dataset2 = dataset.withColumn("weight", lit(1))
+val dataset2 = dataset.withColumn("weight", lit(1.0))
--- End diff --

@yanboliang @srowen  My mistake. It will be better to use double value 1.0 
in master / branch-2.2 / branch-2.3 in the same way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] For branch-2.1, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18763#discussion_r130200461
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -157,6 +157,16 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(output.schema.fieldNames.toSet === Set("label", "features", 
"prediction"))
   }
 
+  test("SPARK-21306: OneVsRest should support setWeightCol") {
+val dataset2 = dataset.withColumn("weight", lit(1.0))
--- End diff --

use double value, `lit(1.0)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18764#discussion_r130200379
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -143,6 +144,16 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(output.schema.fieldNames.toSet === Set("label", "features", 
"prediction"))
   }
 
+  test("SPARK-21306: OneVsRest should support setWeightCol") {
+val dataset2 = dataset.withColumn("weight", lit(1.0))
--- End diff --

use double value, `lit(1.0)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18764#discussion_r130200288
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -33,6 +33,7 @@ import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.mllib.util.TestingUtils._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
--- End diff --

add missing import for branch 2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18764: [SPARK-21306][ML] For branch 2.0, OneVsRest shoul...

2017-07-28 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18764

[SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol

The PR is related to #18554, and is modified for branch 2.0.

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark 
BUG/branch-2.0_OneVsRest_support_setWeightCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18764.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18764


commit 0c60c28ca6a3da84401f4947a21ace6980be08fa
Author: Yan Facai (颜发才) 
Date:   2017-07-28T02:10:35Z

[SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18763: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-28 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18763

 [SPARK-21306][ML] OneVsRest should support setWeightCol for branch-2.1

The PR is related to #18554, and is modified for branch 2.1.

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark 
BUG/branch-2.1_OneVsRest_support_setWeightCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18763.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18763


commit ddb83da765f38546a96a63b26478f594601c7e2b
Author: Yan Facai (颜发才) 
Date:   2017-07-28T02:10:35Z

[SPARK-21306][ML] OneVsRest should support setWeightCol




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-26 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18554#discussion_r129562237
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1255,6 +1255,24 @@ def test_output_columns(self):
 output = model.transform(df)
 self.assertEqual(output.columns, ["label", "features", 
"prediction"])
 
+def test_support_for_weightCol(self):
+df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8), 
1.0),
+ (1.0, Vectors.sparse(2, [], []), 
1.0),
+ (2.0, Vectors.dense(0.5, 0.5), 
1.0)],
+["label", "features", "weight"])
+# classifier inherits hasWeightCol
+lr = LogisticRegression(maxIter=5, regParam=0.01)
+ovr = OneVsRest(classifier=lr, weightCol="weight")
+self.assertIsNotNone(ovr.fit(df))
+ovr2 = OneVsRest(classifier=lr).setWeightCol("weight")
--- End diff --

cleaned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-26 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18554#discussion_r129562189
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1517,20 +1517,22 @@ class OneVsRest(Estimator, OneVsRestParams, 
MLReadable, MLWritable):
 
 @keyword_only
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
- classifier=None):
+ weightCol=None, classifier=None):
--- End diff --

moved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should support setWeightCol

2017-07-25 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18554
  
ping @holdenk @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-07-25 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18736

[SPARK-21481][ML] Add indexOf method for ml.feature.HashingTF

## What changes were proposed in this pull request?

Add indexOf method for ml.feature.HashingTF.

The PR is a hotfix by storing hashingTF.
I believe it is better to migrate native implement from mllib to ml. I can 
work on it, but I don't know whether to open an new JIRA for migrating or not.

## How was this patch tested?

+ [x] add a test case.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/add_indexOf_for_ml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18736.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18736


commit f08861d64390f0eb4d7e34de13d78db67e046457
Author: Yan Facai (颜发才) 
Date:   2017-07-26T05:45:37Z

hotfix: store hashingTF to reuse indexOf

commit 174a2d57ebeccba10b8f1435970cde69f5829c79
Author: Yan Facai (颜发才) 
Date:   2017-07-26T06:14:37Z

TST: add test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should support setWei...

2017-07-18 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18554#discussion_r128158473
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1255,6 +1255,17 @@ def test_output_columns(self):
 output = model.transform(df)
 self.assertEqual(output.columns, ["label", "features", 
"prediction"])
 
+def test_support_for_weightCol(self):
--- End diff --

Sure. Use DecisionTreeClassifier to test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-18 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18305#discussion_r127972263
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") (
 val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, 
numClasses, $(fitIntercept),
+  multinomial = isMultinomial)(_)
+val getFeaturesStd = (j: Int) => if (j >= 0 && j < 
numCoefficientSets * numFeatures) {
+  featuresStd(j / numCoefficientSets)
+} else {
+  0.0
+}
+
+val regularization = if (regParamL2 != 0.0) {
+  val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * 
numCoefficientSets
--- End diff --

Thanks for clarifying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-17 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18305#discussion_r127874833
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala
 ---
@@ -32,40 +34,45 @@ private[ml] trait DifferentiableRegularization[T] 
extends DiffFunction[T] {
 }
 
 /**
- * A Breeze diff function for computing the L2 regularized loss and 
gradient of an array of
+ * A Breeze diff function for computing the L2 regularized loss and 
gradient of a vector of
  * coefficients.
  *
  * @param regParam The magnitude of the regularization.
  * @param shouldApply A function (Int => Boolean) indicating whether a 
given index should have
  *regularization applied to it.
- * @param featuresStd Option indicating whether the regularization should 
be scaled by the standard
- *deviation of the features.
+ * @param applyFeaturesStd Option for a function which maps coefficient 
index (column major) to the
+ * feature standard deviation. If `None`, no 
standardization is applied.
  */
 private[ml] class L2Regularization(
-val regParam: Double,
+override val regParam: Double,
 shouldApply: Int => Boolean,
-featuresStd: Option[Array[Double]]) extends 
DifferentiableRegularization[Array[Double]] {
+applyFeaturesStd: Option[Int => Double]) extends 
DifferentiableRegularization[Vector] {
 
-  override def calculate(coefficients: Array[Double]): (Double, 
Array[Double]) = {
-var sum = 0.0
-val gradient = new Array[Double](coefficients.length)
-coefficients.indices.filter(shouldApply).foreach { j =>
-  val coef = coefficients(j)
-  featuresStd match {
-case Some(stds) =>
-  val std = stds(j)
-  if (std != 0.0) {
-val temp = coef / (std * std)
-sum += coef * temp
-gradient(j) = regParam * temp
-  } else {
-0.0
+  override def calculate(coefficients: Vector): (Double, Vector) = {
+coefficients match {
+  case dv: DenseVector =>
+var sum = 0.0
+val gradient = new Array[Double](dv.size)
+dv.values.indices.filter(shouldApply).foreach { j =>
+  val coef = coefficients(j)
+  applyFeaturesStd match {
+case Some(getStd) =>
+  val std = getStd(j)
+  if (std != 0.0) {
+val temp = coef / (std * std)
+sum += coef * temp
+gradient(j) = regParam * temp
+  } else {
+0.0
+  }
+case None =>
+  sum += coef * coef
+  gradient(j) = coef * regParam
--- End diff --

Trivial, to match `regParam * temp` above, how about using `regParam * 
coef`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-17 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18305#discussion_r127873828
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") (
 val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, 
numClasses, $(fitIntercept),
+  multinomial = isMultinomial)(_)
+val getFeaturesStd = (j: Int) => if (j >= 0 && j < 
numCoefficientSets * numFeatures) {
+  featuresStd(j / numCoefficientSets)
+} else {
+  0.0
+}
+
+val regularization = if (regParamL2 != 0.0) {
+  val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * 
numCoefficientSets
--- End diff --

It seems that the `regularization` contains `intercept`, right?

However, the comment in [LogisticRegression.scala: 
1903L](https://github.com/apache/spark/pull/18305/files#diff-3734f1689cb8a80b07974eb93de0795dL1903)
 is:
> // We do not apply regularization to the intercepts



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18554#discussion_r126863072
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -317,7 +318,12 @@ final class OneVsRest @Since("1.4.0") (
 val numClasses = 
MetadataUtils.getNumClasses(labelSchema).fold(computeNumClasses())(identity)
 instr.logNumClasses(numClasses)
 
-val multiclassLabeled = dataset.select($(labelCol), $(featuresCol))
+val multiclassLabeled = getClassifier match {
+  // SPARK-21306: cache weightCol if necessary
+  case c: HasWeightCol if c.isDefined(c.weightCol) && 
c.getWeightCol.nonEmpty =>
+dataset.select($(labelCol), $(featuresCol), c.getWeightCol)
+  case _ => dataset.select($(labelCol), $(featuresCol))
+}
--- End diff --

Hi, @yanboliang . As @MLnick said, no all classifiers inherits 
HasWeightCol, so it might cause confusion.

In my opinion, `setWeightCol` is an attribute owned by one specific 
classifier, like `setProbabilityCol` and `setRawPredictionCol` for Logistic 
Regreesion. So I'd suggest that user should configure the classifier itself, 
rather than OneVsRest. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18582#discussion_r126646511
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -36,7 +36,8 @@ import org.apache.spark.util.collection.OpenHashMap
 /**
  * Base trait for [[StringIndexer]] and [[StringIndexerModel]].
  */
-private[feature] trait StringIndexerBase extends Params with HasInputCol 
with HasOutputCol {
+private[feature] trait StringIndexerBase extends Params with 
HasHandleInvalid with HasInputCol
--- End diff --

Perhaps it is better to break the line at `extends`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18582#discussion_r126645714
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -3058,26 +3035,37 @@ class RFormula(JavaEstimator, HasFeaturesCol, 
HasLabelCol, JavaMLReadable, JavaM
"RFormula drops the same category as R 
when encoding strings.",
typeConverter=TypeConverters.toString)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle 
invalid entries. " +
--- End diff --

If I understand correctly, `handleInvalid` has been declared in 
`HasHandleInvalid` interface, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18582#discussion_r126643882
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -460,16 +460,16 @@ object LinearRegression extends 
DefaultParamsReadable[LinearRegression] {
   val MAX_FEATURES_FOR_NORMAL_SOLVER: Int = 
WeightedLeastSquares.MAX_NUM_FEATURES
 
   /** String name for "auto". */
-  private[regression] val AUTO = "auto"
+  private[regression] val Auto = "auto"
--- End diff --

Is `solver` related with `handleInvalid`? It seems a little confused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18582: [SPARK-18619][ML] Make QuantileDiscretizer/Bucket...

2017-07-11 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18582#discussion_r126642928
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -36,7 +36,8 @@ import org.apache.spark.sql.types.{DoubleType, 
StructField, StructType}
  */
 @Since("1.4.0")
 final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: 
String)
-  extends Model[Bucketizer] with HasInputCol with HasOutputCol with 
DefaultParamsWritable {
+  extends Model[Bucketizer] with HasHandleInvalid with HasInputCol with 
HasOutputCol
+with DefaultParamsWritable {
--- End diff --

How about aligning `with` with `extends`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-10 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18554
  
@srowen @yanboliang Could you help review the PR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18554
  
I'm not familiar with R, and use grep to search "OneVsRest" and get 
nothing. Hence it seems that nothing is needed to do with R part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18523
  
@SparkQA test again, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18556#discussion_r126050849
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
   files: Seq[FileStatus]): Option[StructType] = {
 val libSVMOptions = new LibSVMOptions(options)
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
-  // Infers number of features if the user doesn't specify (a valid) 
one.
-  val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
-  val path = if (dataFiles.length == 1) {
-dataFiles.head.getPath.toUri.toString
-  } else if (dataFiles.isEmpty) {
+  if (files.isEmpty) {
 throw new IOException("No input path specified for libsvm data")
--- End diff --

In my opinion, it is safe / necessary to check whether the parameter is 
valid in advance. Perhaps `IllegalArgumentException` is more suitable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18554: [SPARK-21306][ML] OneVsRest should cache weightCol if ne...

2017-07-06 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18554
  
@lins05 thanks, reasonable suggestion, I will fix it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18556#discussion_r126026388
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
   files: Seq[FileStatus]): Option[StructType] = {
 val libSVMOptions = new LibSVMOptions(options)
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
-  // Infers number of features if the user doesn't specify (a valid) 
one.
-  val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
--- End diff --

Do you mean that _success file of hdfs, if existed, will be safely ignored 
by spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18556: [SPARK-21326][SPARK-21066][ML] Use TextFileFormat...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18556#discussion_r126023986
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -89,18 +93,17 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
   files: Seq[FileStatus]): Option[StructType] = {
 val libSVMOptions = new LibSVMOptions(options)
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
-  // Infers number of features if the user doesn't specify (a valid) 
one.
-  val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
-  val path = if (dataFiles.length == 1) {
-dataFiles.head.getPath.toUri.toString
-  } else if (dataFiles.isEmpty) {
+  if (files.isEmpty) {
 throw new IOException("No input path specified for libsvm data")
-  } else {
-throw new IOException("Multiple input paths are not supported for 
libsvm data.")
+  } else if (files.length > 1) {
+logWarning(
+  "Multiple input paths detected, determining the number of 
features by going " +
--- End diff --

Good. How about warning only if `numFeature` is missing? I mean, remove the 
condition `files.length > 1`

```scala
if (files.isEmpty) { }
else { ... logWarning ...}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125860650
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
-inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+val incorrectColumns = inputColNames.flatMap { name =>
+  schema(name).dataType match {
+case _: NumericType | BooleanType => None
+case t if t.isInstanceOf[VectorUDT] => None
+case other => Some(s"Data type $other of column $name is not 
supported.")
+  }
+}
+if (!incorrectColumns.isEmpty) {
--- End diff --

Yes, modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18554: [SPARK-21306][ML] OneVsRest should cache weightCo...

2017-07-06 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18554

[SPARK-21306][ML] OneVsRest should cache weightCol if necessary

## What changes were proposed in this pull request?

cache weightCol if classifier inherits HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark BUG/oneVsRest_missing_weightCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18554.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18554


commit f1785959e04e462877fd74e6c767e47e0fb207d6
Author: Yan Facai (颜发才) 
Date:   2017-07-06T09:38:01Z

BUG: cache weightCol if necessary

commit a3e6e967a23f06dbd50aee08c34b07055c5646ee
Author: Yan Facai (颜发才) 
Date:   2017-07-06T09:55:12Z

TST: add unit test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18523
  
@SparkQA Jenkins, run tests again, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-05 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125763918
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,15 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
-inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+val incorrectColumns = inputColNames.flatMap { name =>
--- End diff --

Hi, yanbo. Use `flatMap` is aimed at unpacking value from Option. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18523
  
I don't know how to write an unit test for the pr? Is it necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18523: [SPARK-21285][ML] VectorAssembler reports the column nam...

2017-07-05 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18523
  
Good idea!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-05 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125584572
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
+val inputDataTypes = inputColNames.map(name => (name, 
schema(name).dataType))
 inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+  case (_, _: NumericType | BooleanType) =>
+  case (_, t) if t.isInstanceOf[VectorUDT] =>
--- End diff --

Hey, it will throw no found exception if `t: VectorUDT` is used. Do you 
have any idea?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-07-05 Thread facaiy
GitHub user facaiy reopened a pull request:

https://github.com/apache/spark/pull/17383

[SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data

## What changes were proposed in this pull request?

DecisionTree should take advantage of sparse feature vectors. Aggregation 
over training data could handle the empty/zero-valued data elements more 
efficiently.


## How was this patch tested?

Modifying Inner implementation won't change behavior of DecisionTree module,
hence all unit tests before should pass.

Some performance benchmark perhaps are need.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17383


commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb
Author: 颜发才(Yan Facai) 
Date:   2017-03-22T05:45:58Z

CLN: use Vector to implement binnedFeatures in TreePoint

commit 9ce6b813beffb9d58e7b2907425a1262610256be
Author: 颜发才(Yan Facai) 
Date:   2017-03-22T09:15:30Z

BUG: fix for incompatible argument of predictImpl method

commit 37f05f9b0386acc8bea048e72aff2b9c37ca4ca6
Author: 颜发才(Yan Facai) 
Date:   2017-03-22T09:18:04Z

CLN: create sparse vector when converting to TreePoint

commit c9664ce6c94b98cbc76253817e637d9a968e4bd6
Author: 颜发才(Yan Facai) 
Date:   2017-03-22T09:21:59Z

CLN: change Array to Vector in TreePoint when created

commit d6ef9e512ea4a58db2dccf3e7cca95f9e8b0df8f
Author: 颜发才(Yan Facai) 
Date:   2017-03-23T02:12:22Z

PREP: use Vector[Int] to store binnedFeature

commit 59eb779a9d4f711e7b28d31d579cc49e3d3cc370
Author: 颜发才(Yan Facai) 
Date:   2017-03-23T03:50:14Z

CLN: change binnedFeatures from def to val

commit 9cbe577b408e987f3026d01316f5a7f2d4c5cfb2
Author: 颜发才(Yan Facai) 
Date:   2017-03-28T00:57:42Z

CLN: use filter to select non-zero bits

commit b5b0dc8683b6e2d7d274aa8d39932dec61e6193d
Author: 颜发才(Yan Facai) 
Date:   2017-03-28T01:03:55Z

BUG: fix, compile fails

commit cf7e3d8e03f73df725336d0d5a9dd6cc16e7bf95
Author: Yan Facai (颜发才) 
Date:   2017-07-05T05:42:09Z

Merge branch 'master' into ENH/use_sparsity_in_decision_tree

commit 032d50d8c8a851671ba2754cec817d0f6e9ae70f
Author: Yan Facai (颜发才) 
Date:   2017-07-05T06:20:38Z

CLN: use BSV in predictImpl

commit 257ddf773eb47499962d6cc57fd1323324dd4ab8
Author: Yan Facai (颜发才) 
Date:   2017-07-05T06:42:24Z

ENH: create subclass TreeSparsePoint

commit 8a919735f9474283d263df78feb2e176f66917f3
Author: Yan Facai (颜发才) 
Date:   2017-07-05T06:58:54Z

ENH: use TreeDensePoint when numFeatures < 1




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17383: [SPARK-3165][MLlib][WIP] DecisionTree does not us...

2017-07-04 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/17383


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-04 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125539040
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
+val inputDataTypes = inputColNames.map(name => (name, 
schema(name).dataType))
 inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+  case (_, _: NumericType | BooleanType) =>
+  case (_, _: VectorUDT) =>
+  case (name, other) =>
+throw new IllegalArgumentException(s"Data type $other of column 
$name is not supported.")
--- End diff --

Fine, modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-07-04 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
@jkbradley  May you have time reviewing the pr? I believe that it will be a 
little improvement for predict. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-03 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18523#discussion_r125398010
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -113,12 +113,12 @@ class VectorAssembler @Since("1.4.0") 
(@Since("1.4.0") override val uid: String)
   override def transformSchema(schema: StructType): StructType = {
 val inputColNames = $(inputCols)
 val outputColName = $(outputCol)
-val inputDataTypes = inputColNames.map(name => schema(name).dataType)
+val inputDataTypes = inputColNames.map(name => (name, 
schema(name).dataType))
 inputDataTypes.foreach {
-  case _: NumericType | BooleanType =>
-  case t if t.isInstanceOf[VectorUDT] =>
-  case other =>
-throw new IllegalArgumentException(s"Data type $other is not 
supported.")
+  case (_, _: NumericType | BooleanType) =>
+  case (_, t) if t.isInstanceOf[VectorUDT] =>
--- End diff --

Good suggestion. fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18523: [SPARK-21285][ML] VectorAssembler reports the col...

2017-07-03 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18523

[SPARK-21285][ML] VectorAssembler reports the column name of unsupported 
data type

## What changes were proposed in this pull request?
add the column name in the exception which is raised by unsupported data 
type.

## How was this patch tested?
[ ] pass all tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/vectorassembler_add_col

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18523.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18523


commit 95dbf6c7b287d0010af9de377ff6b93dec760808
Author: Yan Facai (颜发才) 
Date:   2017-07-04T05:42:07Z

ENH: report the name of missing column




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-22 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18288
  
Yes.

an example code:
```scala
val df = spark.read.format("libsvm")
  .option("numFeatures", "780")
  .load("data/mllib/sample_libsvm_data.txt")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-22 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18288
  
You might be mistaken. The aim of code here is to encourage user to specify 
`numFeatures` in any case, rather than encourage user to use only one file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-22 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18288#discussion_r123474003
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
   // Infers number of features if the user doesn't specify (a valid) 
one.
   val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
-  val path = if (dataFiles.length == 1) {
-dataFiles.head.getPath.toUri.toString
-  } else if (dataFiles.isEmpty) {
+  val path = if (dataFiles.isEmpty) {
 throw new IOException("No input path specified for libsvm data")
   } else {
-throw new IOException("Multiple input paths are not supported for 
libsvm data.")
+dataFiles.map(_.getPath.toUri.toString).mkString(",")
--- End diff --

@srowen I see your point: it's OK to infer `numFeature` for either one or 
more files. However, it is costly (or unworthy).

Here is a compromise:
+ For ONLY one file, `numFeatures` is optional.
+ For multiple files, `numFeatures` must be required. It doesn't sample one 
of many files.

To be honest, it is inconvenient while a little reasonable. Hence, I prefer 
to revise the message of exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-22 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18288
  
In my opinion, `numFeatures` is vital for sparse data. 

Say our feature is 100-dim indeed, while in a small train data their 
maximum size is 990. It is dangerous (or wrong) to train a 990-dim model as it 
might fail in the coming test data.

Hence, in most cases, `numFeatures` should be given by user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-20 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18288#discussion_r122909919
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
   // Infers number of features if the user doesn't specify (a valid) 
one.
   val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
-  val path = if (dataFiles.length == 1) {
-dataFiles.head.getPath.toUri.toString
-  } else if (dataFiles.isEmpty) {
+  val path = if (dataFiles.isEmpty) {
 throw new IOException("No input path specified for libsvm data")
   } else {
-throw new IOException("Multiple input paths are not supported for 
libsvm data.")
+dataFiles.map(_.getPath.toUri.toString).mkString(",")
--- End diff --

The exception is misleading. It should suggest user to specify 
`numFeatures`, rather than warn of unsupported message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file

2017-06-20 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18288#discussion_r122908140
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
@@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
 val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
   // Infers number of features if the user doesn't specify (a valid) 
one.
   val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
-  val path = if (dataFiles.length == 1) {
-dataFiles.head.getPath.toUri.toString
-  } else if (dataFiles.isEmpty) {
+  val path = if (dataFiles.isEmpty) {
 throw new IOException("No input path specified for libsvm data")
   } else {
-throw new IOException("Multiple input paths are not supported for 
libsvm data.")
+dataFiles.map(_.getPath.toUri.toString).mkString(",")
--- End diff --

For efficiency, `numFeaures` is limited to scan only one file.
If multiple files are needed, please assign the parameter `numFeatures` 
manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-31 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18139#discussion_r119346663
  
--- Diff: python/pyspark/sql/types.py ---
@@ -187,8 +187,11 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = calendar.timegm(dt.utctimetuple())
+# Avoiding the invalid range of years (100-1899) for mktime in 
Python < 3
+if dt.year > 1899 or dt.year <= 100:
+seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+   else time.mktime(dt.timetuple()))
--- End diff --

How about adding a `else` branch to eliminate repetitive computation of 
`seconds`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-30 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18139#discussion_r119263608
  
--- Diff: python/pyspark/sql/types.py ---
@@ -187,8 +187,11 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = calendar.timegm(dt.utctimetuple())
--- End diff --

Perhaps it need to be confirmed by @davies ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...

2017-05-30 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18120
  
Thanks, @BryanCutler.
It seems that #17849 copys `Params` from `Estimator` to `Model` 
automatically, which is pretty useful. However, `getter` method is still 
missing and need to be added manually as the pr.

If only model could inherit both params and its getter method, perhaps it 
will be better in my opinion.

Anyway, it is OK whether the pr will be merged or not, as it is trivial.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...

2017-05-27 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18120
  
Hi, @keypointt . It's the feature of Python. The doctest is both document 
and unit test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemb...

2017-05-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18120
  
@keypointt Hi, could you help check the pr is consistent with your #17207 ? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18120: [SPARK-20498][PYSPARK][ML] Expose getMaxDepth for...

2017-05-26 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18120

[SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemble tree model in 
PySpark

## What changes were proposed in this pull request?

add `getMaxDepth` method for ensemble tree models:
+ `RandomForestClassifierModel`
+ `RandomForestRegressionModel`
+ `GBTClassificationModel`
+ `GBTRegressionModel`


## How was this patch tested?

+ [ ] pass all unit tests.
+ [ ] add new doctest.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/pyspark_rf_add_get_max_depth

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18120.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18120


commit 96e797f79000964ea226efced7318f24a1722535
Author: Yan Facai (颜发才) 
Date:   2017-05-26T07:48:02Z

ENH: add maxDepth for resemble tree model

commit 50815928f070f89ae71e235d51e833cef596a361
Author: Yan Facai (颜发才) 
Date:   2017-05-26T08:45:11Z

TST: add doctest




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert)...

2017-05-25 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18058
  
Resolved.

By the way,
Which one is preferable, rebase or merge?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert)...

2017-05-24 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/18058
  
Hi, I'm not familiar with pyspark. I just wonder whether is it needed to 
create a unit test for verification. If yes, how to check it? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18058: [SPARK-20768][PYSPARK][ML] Expose numPartitions (...

2017-05-24 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18058#discussion_r118416434
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -49,6 +49,32 @@ def getMinSupport(self):
 return self.getOrDefault(self.minSupport)
 
 
+class HasNumPartitions(Params):
+"""
+Mixin for param support.
+"""
+
+numPartitions = Param(
+Params._dummy(),
+"numPartitions",
+"""Number of partitions (at least 1) used by parallel FP-growth.
+By default the param is not set,
+and partition number of the input dataset is used.""",
+typeConverter=TypeConverters.toInt)
+
+def setNumPartitions(self, value):
+"""
+Sets the value of :py:attr:`numPartitions`.
+"""
+return self._set(numPartitions=value)
+
+def getNumPartitions(self):
+"""
+Gets the value of numPartitions or its default value.
--- End diff --

added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >