from:"\"WeichenXu123\""

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-02-08 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
I read  jkbradley's thoughts here, so I will modify this as following:

first we need 4 traits, using the following hierarchy:
LogisticRegressionSummary
LogisticRegressionTrainingSummary: LogisticRegressionSummary
BinaryLogisticRegressionSummary: LogisticRegressionSummary
BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, 
BinaryLogisticRegressionSummary

and the public method such as `def summary` only return trait type listed 
above.

and then implement 4 concrete classes:
`LogisticRegressionSummaryImpl` (multiclass case) 
`LogisticRegressionTrainingSummaryImpl` (multiclass case)
`BinaryLogisticRegressionSummaryImpl` (binary case).
`BinaryLogisticRegressionTrainingSummaryImpl` (binary case).

Is that right ? @jkbradley @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-02-17 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
cc @sethah @jkbradley Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplica...

2017-01-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15730
  
@brkyvz I update code and attach a running result screenshot, waiting for 
your review, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-01-04 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah Oh, thanks very much! I will update code immediately. As I commit 
this PR I will work on it in the first priority. Happy new year!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-01-10 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah Thanks your careful review! code updated. And about the downcast 
issue, is there a better solution? I think it is difficult to solve ideally. cc 
@jkbradley  @yanboliang thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Reduce shuffled data size of Bl...

2017-01-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15730
  
@brkyvz Done. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Reduce shuffled data size of Bl...

2017-01-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15730
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Reduce shuffled data size of Bl...

2017-01-13 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15730
  
cc @yanboliang Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16574: [SPARK-19189] Optimize CartesianRDD to avoid part...

2017-01-13 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/16574

[SPARK-19189] Optimize CartesianRDD to avoid partition re-computation and 
re-serialization

## What changes were proposed in this pull request?

Current CartesianRDD implementation, suppose RDDA cartisian RDDB, 
generating RDDC,
each RDDA partition will be reading by multiple RDDC partition, and RDDB 
has similar problem.
This will cause, when RDDC partition computing, each partition's data in 
RDDA or RDDB will be repeatedly serialized (then transfer through network), if 
RDDA or RDDB haven't been persist, it will cause RDD recomputation repeatedly.

In this PR, I change the dependency in `CartesianRDD` from 
`NarrowDependency` into `ShuffleDependency`, but still keep the way how the 
parent RDD partitioned. And computing CartesianRDD keep current implementation.

## How was this patch tested?

Add a Cartesian test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark optimize_cartesian

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16574.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16574


commit ff7ff0dfb349bea7e41237be1410007a66f1eb28
Author: WeichenXu 
Date:   2017-01-07T02:14:52Z

init commit

commit 14ba3b24373d7a1d627bbc8b4b3d60ab6a92da07
Author: WeichenXu 
Date:   2017-01-11T04:56:15Z

init pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

2017-01-13 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/16576

[SPARK-19215] Add necessary check for `RDD.checkpoint` to avoid potential 
mistakes

## What changes were proposed in this pull request?

Currently RDD.checkpoint must be called before any job executed on this 
RDD, otherwise the `doCheckpoint` will never be called. This is a pitfall we 
should check this and throw exception (or at least log warning ? ) for such 
case.
And, if RDD haven't been persisted, doing checkpoint will cause RDD 
recomputation, because current implementation will run separated job for 
checkpointing. I think such case it should also print some warning message, 
remind user to check whether he forgot persist the RDD.

## How was this patch tested?

Manual.

Test case 1:
```
val rdd = sc.makeRDD(Array(1,2,3),3)
rdd.count()
rdd.checkpoint() // here because `rdd.count` executed, checkpoint will 
never take effect, so that this patch will directly throw exception.
```

Test case 2:
```
val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
rdd.checkpoint() // because `rdd` do not persisted, so that checkpoint will 
cause this RDD recomputation, this patch will print warning here.
rdd.count() // trigger `doCheckpoint`
```

Test case 3:
```
val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
rdd.persist()
rdd.checkpoint() // This is correct usage, won't print any warning in this 
patch.
rdd.count() // trigger `doCheckpoint`
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
add_check_and_warning_for_rdd_checkpoint

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16576.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16576


commit 70fbfb07adbaca17831fd736661135f2d7b2b0e0
Author: WeichenXu 
Date:   2017-01-13T14:17:28Z

add check and warning msg for rdd checkpoint




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

2017-01-13 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16576
  
cc @rxin Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-13 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16574
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

2017-01-14 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16576#discussion_r96112227
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
 // NOTE: we use a global lock here due to complexities downstream with 
ensuring
 // children RDD partitions point to the correct parent partitions. In 
the future
 // we should revisit this consideration.
+if (doCheckpointCalled) {
+  throw new SparkException("Because job has been executed on this RDD, 
checkpoint won't work")
+}
--- End diff --

All right, how about change to following code:
```
if (doCheckpointCalled) {
  logWarning(s"Because job has been executed on RDD ${id}, checkpoint 
won't work")
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

2017-01-14 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16576#discussion_r96112271
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
 // NOTE: we use a global lock here due to complexities downstream with 
ensuring
 // children RDD partitions point to the correct parent partitions. In 
the future
 // we should revisit this consideration.
+if (doCheckpointCalled) {
+  throw new SparkException("Because job has been executed on this RDD, 
checkpoint won't work")
+}
+if (storageLevel == StorageLevel.NONE) {
+  logWarning("Call checkpoint on unpersisted RDD, it will cause RDD 
recomputation" +
+" when saving checkpoint files.")
+}
--- End diff --

en... your advice is reasonable!
to sovle (a), I can move this warning into `doCheckpoint`, avoid the 
situation `RDD.persist` called later.
about problem (b), how about change this warning into `logInfo` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-14 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16574
  
@mridulm En...so that still keep `NarrowDependency` seems better, but I 
think the recomputation is a serious problem when parents RDD not persisted, I 
think in this case we should try to print some warning message to remind 
developer to check their spark application code... how do you think about it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

2017-01-14 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16576
  
At the end, I think current checkpoint behavior is strange, and our 
customers is very easy to use it in wrong way. For now the core logic is hard 
to modify, but at least we should add useful warning, otherwise many spark 
developers may be confused why sometimes checkpoint fail, OR their application 
re-compute the RDDs but they don't realize it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-14 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16574
  
@mridulm
Year, I know you are worried about the shuffling cost here. Currently when 
`spark.shuffle.reduceLocality.enabled` is true(by default), each shuffling 
reducer will be launched on the node with the largest outputs. So in this PR 
implementation it will generate good data-locality so that its network transfer 
cost is similar to current `NarrowDependency` implementation, IMO.

BUT, you mention that Cartesian has more efficient way to implement using 
shuffling... I would like to research about it and consider better solution. 
Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

2017-01-14 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16576#discussion_r96118042
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1539,6 +1539,13 @@ abstract class RDD[T: ClassTag](
 // NOTE: we use a global lock here due to complexities downstream with 
ensuring
 // children RDD partitions point to the correct parent partitions. In 
the future
 // we should revisit this consideration.
+if (doCheckpointCalled) {
+  throw new SparkException("Because job has been executed on this RDD, 
checkpoint won't work")
+}
--- End diff --

Year, that is the pitfall in `RDD.checkpoint` and it will make developers 
confused, so I try to add some check for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16576: [SPARK-19215] Add necessary check for `RDD.checkpoint` t...

2017-01-14 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16576
  
@mridulm  code updated. thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15730: [SPARK-18218][ML][MLLib] Reduce shuffled data siz...

2017-01-14 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15730#discussion_r96131333
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -459,14 +464,155 @@ class BlockMatrix @Since("1.3.0") (
*/
   @Since("1.3.0")
   def multiply(other: BlockMatrix): BlockMatrix = {
+multiply(other, 1)
+  }
+
+  /**
+   * Left multiplies this [[BlockMatrix]] to `other`, another 
[[BlockMatrix]]. This method add
--- End diff --

All right, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16574: [SPARK-19189] Optimize CartesianRDD to avoid parent RDD'...

2017-01-15 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/16574
  
I need to make a survey for better Cartesian implementation, especially in 
shuffle way. Close this PR for now and when the new solution is done I will 
reopen it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16574: [SPARK-19189] Optimize CartesianRDD to avoid pare...

2017-01-15 Thread WeichenXu123

Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/16574


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15097: [SPARK-17540][SparkR][Spark Core] fix SparkR array serde...

2016-10-08 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15097
  
@felixcheung 
it is the case when r-side pass an empty array and scala-side parameter 
type is Array[T] then there is some problem (as the TODO described in 
serialize.R: 166 line)...but passing an empty array as an parameter is rarely 
used, so I consider close this PR for now, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15097: [SPARK-17540][SparkR][Spark Core] fix SparkR arra...

2016-10-08 Thread WeichenXu123

Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/15097


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15406: [Spark-17745][ml][PySpark] update NB python api

2016-10-09 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15406

[Spark-17745][ml][PySpark] update NB python api

## What changes were proposed in this pull request?

update python api for NaiveBayes.

## How was this patch tested?

Existing tests.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark nb_python_update

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15406.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15406


commit 50e0de7d2b03ad8c0ce3dc39313c92f0a7cd619e
Author: WeichenXu 
Date:   2016-10-09T16:10:54Z

update NB python api




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15406: [Spark-17745][ml][PySpark] update NB python api

2016-10-09 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15406
  
@sethah OK. and I'm checking whether there is something else need to 
update...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2016-10-11 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15435

[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression

## What changes were proposed in this pull request?

Add model summary for MultinomialLogisticRegression.
currently, using `MulticlassMetrics` to do the summary.

because multi-class case has some differences with binary case,
I just expose the `MulticlassMetrics` interface about summary for now.

And I add `findSummaryModelAndPredictionCol` method because the 
`MulticlassMetrics` need the `predictionCol` value. Although the 
`predictionCol` can be calculated from `probabilityCol`,  put some code here to 
do this thing will cause repeated code, so I do not use that way.

## How was this patch tested?

Existing tests.


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark mlor_summary

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15435


commit e93740ead908251cebee22cc00059571a6e22a3f
Author: WeichenXu 
Date:   2016-10-11T00:25:45Z

update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2016-10-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah This pr seems to be discussed about several details, I am pleasure 
to hear your opinion, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15406: [Spark-17745][ml][PySpark] update NB python api -...

2016-10-12 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15406#discussion_r83033809
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1032,6 +1032,16 @@ class NaiveBayes(JavaEstimator, HasFeaturesCol, 
HasLabelCol, HasPredictionCol, H
 >>> result = model3.transform(test0).head()
 >>> result.prediction
 0.0
+>>> dfWithWeight = spark.createDataFrame([
--- End diff --

seems reasonable...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2016-10-15 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah Good suggestion. code updated, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15516: [SPARK-17961][SparkR][SQL] Add storageLevel to Da...

2016-10-17 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15516

[SPARK-17961][SparkR][SQL] Add storageLevel to Dataset for SparkR

## What changes were proposed in this pull request?

Add storageLevel to Dataset for SparkR.
This is similar to this RP:  https://github.com/apache/spark/pull/13780

## How was this patch tested?

test added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark storageLevel_df_r

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15516.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15516


commit d4d5c52ba5dab9ec7ad3d41cffdcde70860e57de
Author: WeichenXu 
Date:   2016-10-16T02:26:01Z

update.

commit 4be3e5fe64a11e0ca0a68c45bc9985abe08dbf0d
Author: WeichenXu 
Date:   2016-10-17T17:03:35Z

update.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15550: [SPARK-18003][Spark Core] Fix bug of RDD zipWithI...

2016-10-18 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15550

[SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex generating wrong 
result when one partition contains more than 2147483647 records

## What changes were proposed in this pull request?

 Fix bug of RDD zipWithIndex generating wrong result when one partition 
contains more than 2147483647 records.

## How was this patch tested?

test added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
fix_rdd_zipWithIndex_overflow

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15550.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15550


commit 18c3f4914ca3fb5800dd68ac9d689b6f11cb2e92
Author: WeichenXu 
Date:   2016-10-18T01:59:24Z

update.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15550: [SPARK-18003][Spark Core] Fix bug of RDD zipWithI...

2016-10-18 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15550#discussion_r83999445
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -833,6 +833,30 @@ class RDDSuite extends SparkFunSuite with 
SharedSparkContext {
 }
   }
 
+  test("zipWithIndex with partition size exceeding MaxInt") {
+val result = sc.parallelize(Seq(1), 1).mapPartitions(
--- End diff --

all right. I'll update the testcase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15552: [SPARK-18007][SparkR][ML] update SparkR MLP - add...

2016-10-19 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15552

[SPARK-18007][SparkR][ML] update SparkR MLP - add initalWeights parameter

## What changes were proposed in this pull request?

update SparkR MLP, add initalWeights parameter.

## How was this patch tested?

test added.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
mlp_r_add_initialWeight_param

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15552.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15552


commit cce77454bb9bc092a2adb6147c3dd9ba38b222a4
Author: WeichenXu 
Date:   2016-10-19T09:14:41Z

update.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14898: [SPARK-16499][ML][MLLib] optimize ann algorithm where us...

2016-10-19 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14898
  
@srowen I research the implements about BLAS on this, and it shows that 
level-1 blas operation using loop is fast enough and there is no need to use 
specific cpu instructions. but here breeze use `cForRange`(a native implement 
for loop) to do the loop instead of `while` in scala. But currently performance 
tests between the two ways show no big difference so I would like to keep the 
code as it is for now. thanks~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14898: [SPARK-16499][ML][MLLib] optimize ann algorithm w...

2016-10-19 Thread WeichenXu123

Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/14898


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15550: [SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex ge...

2016-10-19 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15550
  
@tejasapatil I check the code where reference `RDD.zipWithIndex`, there are 
7 usages currently. Because current test data won't generate big enough 
partition so they don't triggers the bug. The referencing code   are all 
correct no need to update them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15516: [SPARK-17961][SparkR][SQL] Add storageLevel to DataFrame...

2016-10-19 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15516
  
@felixcheung code updated. thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15552: [SPARK-18007][SparkR][ML] update SparkR MLP - add...

2016-10-19 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15552#discussion_r84198787
  
--- Diff: R/pkg/R/mllib.R ---
@@ -706,10 +707,13 @@ setMethod("spark.mlp", signature(data = 
"SparkDataFrame"),
 if (!is.null(seed)) {
   seed <- as.character(as.integer(seed))
 }
+if (!is.null(initialWeights)) {
+  initialWeights <- 
as.array(as.numeric(na.omit(initialWeights)))
--- End diff --

scala-side wrapper need Array[Double] param so use as.array here, like 
other parameters such `layers`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15552: [SPARK-18007][SparkR][ML] update SparkR MLP - add...

2016-10-19 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15552#discussion_r84199689
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/MultilayerPerceptronClassifierWrapper.scala
 ---
@@ -73,6 +75,8 @@ private[r] object MultilayerPerceptronClassifierWrapper
   .setStepSize(stepSize)
   .setPredictionCol(PREDICTED_LABEL_COL)
 if (seed != null && seed.length > 0) mlp.setSeed(seed.toInt)
+if (initialWeights != null) 
mlp.setInitialWeights(Vectors.dense(initialWeights))
--- End diff --

The length must be specific value, for example, 
if layer = [a, b, c],  weight length = a * (b + 1) + b * (c + 1)
this will be checked inside scala MLP class.
so does r-side need recheck it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15550: [SPARK-18003][Spark Core] Fix bug of RDD zipWithIndex & ...

2016-10-19 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15550
  
@tejasapatil check other reference to scala's `zipWithIndex` and fix 
similar case in `RDD.zipWithUniqueId`, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15516: [SPARK-17961][SparkR][SQL] Add storageLevel to Da...

2016-10-19 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15516#discussion_r84218510
  
--- Diff: R/pkg/R/utils.R ---
@@ -385,6 +385,41 @@ getStorageLevel <- function(newLevel = c("DISK_ONLY",
  "OFF_HEAP" = callJStatic(storageLevelClass, 
"OFF_HEAP"))
 }
 
+storageLevelToString <- function(levelObj) {
+  useDisk <- callJMethod(levelObj, "useDisk")
+  useMemory <- callJMethod(levelObj, "useMemory")
+  useOffHeap <- callJMethod(levelObj, "useOffHeap")
+  deserialized <- callJMethod(levelObj, "deserialized")
+  replication <- callJMethod(levelObj, "replication")
+  if (!useDisk && !useMemory && !useOffHeap && !deserialized && 
replication == 1) {
--- End diff --

python has itself `StorageLevel` class, and the python side code about 
`storageLevel`  also exists  duplicated code problem...
and if we make an r-side `StorageLevel` class may cause the code more 
complex and seems won't help solving the duplicated code problem.
What do you think about it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-03-19 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah Don't worry, I will update code ASAP and @yanboliang will also help 
review it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-03-21 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/17373

[SPARK-12664] Expose probability in mlp model

## What changes were proposed in this pull request?

Modify MLP model to inherit `ProbabilisticClassificationModel` and so that 
it can expose the probability  column when transforming data.

## How was this patch tested?

Test added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
expose_probability_in_mlp_model

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17373.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17373


commit 1f0da4e8ebff8509ac3bc6f06004cbecff6356e9
Author: WeichenXu 
Date:   2017-03-20T14:31:14Z

expose probability in mlp model




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-21 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r107325971
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -194,15 +207,9 @@ class LogisticRegressionSuite
 // thresholds and threshold must be consistent: values
 withClue("fit with ParamMap should throw error if threshold, 
thresholds do not match.") {
   intercept[IllegalArgumentException] {
-lr2.fit(smallBinaryDataset,
-  lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> 
(expectedThreshold / 2.0))
-  }
-}
-withClue("fit with ParamMap should throw error if threshold, 
thresholds do not match.") {
-  intercept[IllegalArgumentException] {
-val lr2model = lr2.fit(smallBinaryDataset,
-  lr2.thresholds -> Array(0.3, 0.7), lr2.threshold -> 
(expectedThreshold / 2.0))
-lr2model.getThreshold
+val lr3 = lr2.copy(
--- End diff --

I will revert this. (previously has a bug which cause this test case fail 
but now fixed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15435: [SPARK-17139][ML] Add model summary for Multinomi...

2017-03-21 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15435#discussion_r107326057
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1786,51 +1793,98 @@ class LogisticRegressionSuite
   }
 
   test("evaluate on test set") {
-// TODO: add for multiclass when model summary becomes available
 // Evaluate on test set should be same as that of the transformed 
training data.
 val lr = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(1.0)
   .setThreshold(0.6)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.summary.asInstanceOf[BinaryLogisticRegressionSummary]
-
-val sameSummary =
-  
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
-assert(summary.areaUnderROC === sameSummary.areaUnderROC)
-assert(summary.roc.collect() === sameSummary.roc.collect())
-assert(summary.pr.collect === sameSummary.pr.collect())
+  .setFamily("binomial")
+val blorModel = lr.fit(smallBinaryDataset)
+val blorSummary = blorModel.binarySummary
+
+val sameBlorSummary =
+  
blorModel.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+assert(blorSummary.areaUnderROC === sameBlorSummary.areaUnderROC)
+assert(blorSummary.roc.collect() === sameBlorSummary.roc.collect())
+assert(blorSummary.pr.collect === sameBlorSummary.pr.collect())
 assert(
-  summary.fMeasureByThreshold.collect() === 
sameSummary.fMeasureByThreshold.collect())
-assert(summary.recallByThreshold.collect() === 
sameSummary.recallByThreshold.collect())
+  blorSummary.fMeasureByThreshold.collect() === 
sameBlorSummary.fMeasureByThreshold.collect())
 assert(
-  summary.precisionByThreshold.collect() === 
sameSummary.precisionByThreshold.collect())
+  blorSummary.recallByThreshold.collect() === 
sameBlorSummary.recallByThreshold.collect())
+assert(
+  blorSummary.precisionByThreshold.collect()
+=== sameBlorSummary.precisionByThreshold.collect())
+
+lr.setFamily("multinomial")
+val mlorModel = lr.fit(smallMultinomialDataset)
+val mlorSummary = mlorModel.summary
+
+val mlorSameSummary = mlorModel.evaluate(smallMultinomialDataset)
+
+assert(mlorSummary.truePositiveRateByLabel === 
mlorSameSummary.truePositiveRateByLabel)
+assert(mlorSummary.falsePositiveRateByLabel === 
mlorSameSummary.falsePositiveRateByLabel)
+assert(mlorSummary.precisionByLabel === 
mlorSameSummary.precisionByLabel)
+assert(mlorSummary.recallByLabel === mlorSameSummary.recallByLabel)
+assert(mlorSummary.fMeasureByLabel === mlorSameSummary.fMeasureByLabel)
+assert(mlorSummary.accuracy === mlorSameSummary.accuracy)
+assert(mlorSummary.weightedTruePositiveRate === 
mlorSameSummary.weightedTruePositiveRate)
+assert(mlorSummary.weightedFalsePositiveRate === 
mlorSameSummary.weightedFalsePositiveRate)
+assert(mlorSummary.weightedPrecision === 
mlorSameSummary.weightedPrecision)
+assert(mlorSummary.weightedRecall === mlorSameSummary.weightedRecall)
+assert(mlorSummary.weightedFMeasure === 
mlorSameSummary.weightedFMeasure)
   }
 
   test("evaluate with labels that are not doubles") {
 // Evaluate a test set with Label that is a numeric type other than 
Double
-val lr = new LogisticRegression()
+val blor = new LogisticRegression()
   .setMaxIter(1)
   .setRegParam(1.0)
-val model = lr.fit(smallBinaryDataset)
-val summary = 
model.evaluate(smallBinaryDataset).asInstanceOf[BinaryLogisticRegressionSummary]
+val blorModel = blor.fit(smallBinaryDataset)
+val blorSummary = blorModel.evaluate(smallBinaryDataset)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
+
+val blorLongLabelData = 
smallBinaryDataset.select(col(blorModel.getLabelCol).cast(LongType),
+  col(blorModel.getFeaturesCol))
+val blorLongSummary = blorModel.evaluate(blorLongLabelData)
+  .asInstanceOf[BinaryLogisticRegressionSummary]
--- End diff --

OK. I agree to do it in separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-03-21 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah Code updated.
cc @yanboliang Pls help boost this PR thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-03-24 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-03-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah I think `truePositiveRate` is equivalent to `recall` so I directly 
implement another one in the trait, but if you don't like this way I can modify 
it.
Is there anything else need update?
Thx!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-03-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah OK no problem! I can move the method implementations into trait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-03-10 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
jenkins, test please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-03-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
Done. cc @sethah @jkbradley thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplica...

2016-11-19 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15730
  
@brkyvz

Good question about the shuffling data in the sparse case. Now I give some 
simple analysis on it (maybe not very strict):
as discussed above, the shuffling data contains step-1 and step-2,
if we keep parallelism the same, we need to increase the `midDimSplitNum` 
and decrease `ShuffleRowPartitions` and `ShuffleColPartitions`, in such case,  
we can make sure that step-1 shuffling data will always reduce, and step-2 
shuffling data will always increase.
Now let us consider the sparse case, the step-2  shuffling data increasing 
because when each pair of left-matrix row-sets multiplying right-matrix 
col-sets, we split it into multiple parts (`midDimSplitNum` parts), in each 
parts we do the multiplying and we need to aggregate all parts together, the 
more parts need to be aggregate, the more shuffling data it needs. BUT, in 
sparse case, these parts(after multiplying) will be empty in high probability, 
so to those empty parts it shuffle nothing. So, we can expect that in sparse 
case, step-2 shuffling data won't increase much rather than the `midDimSplitNum 
= 1` case, because most split parts will shuffle nothing.

Now I am considering improve the API interface to make it easier to use.
I would like to let user specified the parallism, and the algorithm 
automatically calculate the optimal `midDimSplitNum`, what do you think about 
it ?

And thanks for careful review!
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15587: [SPARK-18051][Spark Core] fix bug of custom Parti...

2016-10-21 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15587

[SPARK-18051][Spark Core] fix bug of custom PartitionCoalescer causing 
serialization exception

## What changes were proposed in this pull request?

mark `PartitionCoalescer` and `PartitionGroup`  as Serializable.

## How was this patch tested?

Manual.(test code in jira [SPARK-18051])



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_coalescer_bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15587.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15587


commit 2d88eaa66795239008366d19f8137682db59182e
Author: WeichenXu 
Date:   2016-10-21T15:27:38Z

update.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15587: [SPARK-18051][Spark Core] fix bug of custom PartitionCoa...

2016-10-21 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15587
  
@rxin How about this way: modify the parameter as 
`Option[PartitionerCoalescer with Serializable]`, so that if user forget to 
mark the implementation class as serializable, compiling time will print the 
error rather than the runtime throw some unreadable exceptions... what do you 
think about it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15516: [SPARK-17961][SparkR][SQL] Add storageLevel to DataFrame...

2016-10-21 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15516
  
@felixcheung
Remove the unrelated jar file.
and about the `String`  look up table, here seems there are not the mapping 
relationship between these String constant, so that the code I thinks the code 
just keep it current status is fine, no need to add some look-up table.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15587: [SPARK-18051][Spark Core] fix bug of custom PartitionCoa...

2016-10-22 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15587
  
@rxin Good suggestion. code & document updated, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15612: [SPARK-18078] Add option for customize zipPartiti...

2016-10-24 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15612

[SPARK-18078] Add option for customize zipPartition task preferred locations

## What changes were proposed in this pull request?

add an option for `RDD.zipPartitions`, so that we can control the 
zipPartition task preferred locations to be consistent with specific RDD 
(Usually we specify it to be the RDD which is much larger than other  zipped 
RDDs.)

I test this improvement on spark-tfocus DMatrix multiplying DVector

https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala

I generate the `DMatrix` RDD, using 100 partitions and each partition 
contains about 100MB data. and the `DVector` also 100 partitions, each 
partitions about 1MB data, and I start the job using 10 executors, each 
executor contains 4 cores. and I set the `spark.locality.wait` to be 3000ms, 
the result show that using the option, it have about 30% performance 
improvement.

## How was this patch tested?

Manual.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
customize_zipPartition_task_preferredLocation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15612.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15612


commit e6d069008b931de7479819e84458645e9049254d
Author: WeichenXu 
Date:   2016-10-24T15:19:25Z

customize_zipPartition_task_preferredLocation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15612: [SPARK-18078] Add option for customize zipPartiti...

2016-10-24 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15612#discussion_r84821928
  
--- Diff: 
core/src/main/scala/org/apache/spark/rdd/ZippedPartitionsRDD.scala ---
@@ -58,10 +63,14 @@ private[spark] abstract class 
ZippedPartitionsBaseRDD[V: ClassTag](
 s"Can't zip RDDs with unequal numbers of partitions: 
${rdds.map(_.partitions.length)}")
 }
 Array.tabulate[Partition](numParts) { i =>
-  val prefs = rdds.map(rdd => 
rdd.preferredLocations(rdd.partitions(i)))
-  // Check whether there are any hosts that match all RDDs; otherwise 
return the union
-  val exactMatchLocations = prefs.reduce((x, y) => x.intersect(y))
-  val locs = if (!exactMatchLocations.isEmpty) exactMatchLocations 
else prefs.flatten.distinct
+  val locs = if (preferredLocationRDD != null) {
+
preferredLocationRDD.preferredLocations(preferredLocationRDD.partitions(i))
--- End diff --

Good suggestion, it need a fallback handling, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15612: [SPARK-18078] Add option for customize zipPartition task...

2016-10-24 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15612
  
@rxin Hope for your advice, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15516: [SPARK-17961][SparkR][SQL] Add storageLevel to DataFrame...

2016-10-24 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15516
  
@felixcheung @yanboliang thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2016-10-25 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah jkbradley seems not online recently we can invite @yanboliang to 
give some advice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15552: [SPARK-18007][SparkR][ML] update SparkR MLP - add inital...

2016-10-25 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15552
  
@felixcheung
updated following places:
1. clarify `initialWeights` parameter should be a vector
2. add require check in the mlp scala wrapper that if initialWeights != 
null its length > 0
3. update the example with initial weights


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15639: [Spark-Core]add defensive check for zipWithIndex

2016-10-25 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15639#discussion_r85052950
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -1765,6 +1765,7 @@ private[spark] object Utils extends Logging {
*/
   def getIteratorZipWithIndex[T](iterator: Iterator[T], startIndex: Long): 
Iterator[(T, Long)] = {
 new Iterator[(T, Long)] {
+  require(startIndex > 0, "startIndex should be > 0.")
--- End diff --

It seems to be `startIndex >= 0` ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15639: [Spark-Core]add defensive check for zipWithIndex

2016-10-25 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15639#discussion_r85053450
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -1765,6 +1765,7 @@ private[spark] object Utils extends Logging {
*/
   def getIteratorZipWithIndex[T](iterator: Iterator[T], startIndex: Long): 
Iterator[(T, Long)] = {
 new Iterator[(T, Long)] {
+  require(startIndex > 0, "startIndex should be > 0.")
--- End diff --

yeah, this case inital value = -1, but fisrt generated index is 0, because 
there is a `index += 1` clause running first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15516: [SPARK-17961][SparkR][SQL] Add storageLevel to DataFrame...

2016-10-25 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15516
  
@felixcheung update rdname, `unpersited-method` also updated by the way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15639: [Spark-Core]add defensive check for zipWithIndex

2016-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15639
  
LGTM. thanks~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocation fun...

2016-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15612
  
cc @rxin @mridulm Several updates, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocation fun...

2016-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15612
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocation fun...

2016-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15612
  
@mridulm yeah, I'am thinking about this, A util class is to avoid polluting 
the RDD class as rxin like, and the reason I change parameter 
`preferredLocationsRDD` into `useFirstParentPreferredLocations` is because,  
the way `preferredLocationsRDD` we need to dispose the reference to the 
`preferredLocationsRDD` when the zipPartitionRDD is cleaned, it makes code 
fussy. Would restricting the majorRdd to be the first RDD bring some 
side-effect ? I think it is equivalent to the previous version only current 
version may need user to swap the parameter orders.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocat...

2016-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15612#discussion_r85251725
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/ZipPartitionsRDDUtils.scala ---
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.rdd.{RDD, ZippedPartitionsRDD2, 
ZippedPartitionsRDD3,
+  ZippedPartitionsRDD4}
+
+object ZipPartitionsRDDUtils {
--- End diff --

OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocat...

2016-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15612#discussion_r85251773
  
--- Diff: 
core/src/main/scala/org/apache/spark/rdd/ZippedPartitionsRDD.scala ---
@@ -45,7 +45,8 @@ private[spark] class ZippedPartitionsPartition(
 private[spark] abstract class ZippedPartitionsBaseRDD[V: ClassTag](
 sc: SparkContext,
 var rdds: Seq[RDD[_]],
-preservesPartitioning: Boolean = false)
+preservesPartitioning: Boolean = false,
--- End diff --

I found `preservesPartitioning` in RDD class everywhere all of them 
need to be updated ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2016-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah @jkbradley 

Now I am considering elemating the `probabilityToPredictionUDF`,
There are some fussy problems about the `predictionCol` and the summary 
class hierarchy

currently, the base interface `LogisticRegressionSummary` do not have the 
member `predictionCol`, which is needed in `MLOR summary`.
if I add the `predictionCol` into the base interface 
`LogisticRegressionSummary` then `BinaryLogisticRegressionTrainingSummary` will 
also need to be modified, it seems break API compatibility, should we avoid 
such thing ?

And, whether should we make `BinaryLogisticRegressionTrainingSummary` be 
the subclass of MLOR summary ? I would like let @jkbradley decide it.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocation fun...

2016-10-27 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15612
  
@rxin Move into `rdd` package and add doc for `preservesPartitioner`. The 
code base have many other places using `preservesPartitioning` I don't update 
them for now, we can let other person to complete this var renaming work.
@mridulm pleasure to hear you for finding more possible improvements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocation fun...

2016-10-27 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15612
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2016-10-27 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
All right. I will create a new class `MulticlassClassificationSummary` it 
looks better.
And I will change `BinaryLogisticRegressionTrainingSummary` be the subclass 
of MLOR summary, it looks more reasonable. If there is some problem in such 
hierarchy let me know it. @jkbradley 

And thanks @zhengruifeng for good suggestion~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocation fun...

2016-10-28 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15612
  
@rxin 
Documents updated for every methods.
and still keep var name `preservesPartitioning` to keep consistent with 
other places in code base.
thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15612: [SPARK-18078] Add zipPartitionsWithPreferredLocation fun...

2016-10-28 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15612
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15716: [SPARK-18201] add toDense and toSparse into Matri...

2016-11-01 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15716

[SPARK-18201] add toDense and toSparse into Matrix trait, like Vector design

## What changes were proposed in this pull request?

add toDense and toSparse into Matrix trait, like Vector design
so that when we have a Matrix object `matrix: Matrix`, maybe dense or 
sparse,
we can call `matrix.toDense` to get DenseMatrix
and call `matrix.toSparse` to get SparseMatrix

## How was this patch tested?

N/A



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
add_toDense_toSparse_into_matrix_base

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15716.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15716


commit 102fa83866ef03f75c6198cb9a026822af90e1f2
Author: WeichenXu 
Date:   2016-11-01T02:31:46Z

init pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15716: [SPARK-18201][ML] Add toDense and toSparse into Matrix t...

2016-11-01 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15716
  
@sethah All right, I will take a look at it later, thanks~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15716: [SPARK-18201][ML] Add toDense and toSparse into M...

2016-11-01 Thread WeichenXu123

Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/15716


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15730: [SPARK-18218][ML][MLLib] Optimize BlockMatrix mul...

2016-11-02 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/15730

[SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplication and move it 
into ml package

## What changes were proposed in this pull request?

### The problem in current block matrix mulitiplication

As in JIRA https://issues.apache.org/jira/browse/SPARK-18218 described, 
block matrix multiplication in spark may cause some problem, suppose we have 
M*N dimensions matrix A multiply N*P dimensions matrix B, when N is much larger 
than M and P, then the following problem may occur:
- when the middle dimension N is too large, it will cause reducer OOM.
- even if OOM do not occur, it will still cause parallism too low.
- when N is much large than M and P, and matrix A and B have many 
partitions, it may cause too many partition on M and P dimension, it will cause 
much larger shuffled data size. (I will expain this in detail in the following.)

### Key point of my improvement

In this PR, I introduce `midDimSplitNum` parameter, and improve the 
algorithm, to resolve this problem.

In order to understand the improvement in this PR, first let me give a 
simple case to explain how the current mulitiplication works and what cause the 
problems above:

suppose we have block matrix A, contains 200 blocks (2 rowBlockNum * 100 
colBlockNum), blocks arranged in 2 rows, 100 cols:
```
A00 A01 A02 ... A0,99
A10 A11 A12 ... A1,99
```
and we have block matrix B, also contains 200 blocks (100 rowBlockNum * 2 
colBlockNum), blocks arranged in 100 rows, 2 cols:
```
B00B01
B10B11
B20B21
...
B99,0  B99,1
```
Suppose all blocks in the two matrices are dense for now.
Now we call A.multiply(B), suppose the generated `resultPartitioner` 
contains 2 rowPartitions and 2 colPartitions (can't be more partitions because 
the result matrix only contains 2 * 2 blocks), the current algorithm will 
contains two shuffle steps:

**step-1**
Step-1 will generate 4 reducer, I tag them as reducer-00, reducer-01, 
reducer-10, reducer-11, and shuffle data as following:
```
A00 A01 A02 ... A0,99
B00 B10 B20 ... B99,0shuffled into reducer-00

A00 A01 A02 ... A0,99
B01 B11 B21 ... B99,1shuffled into reducer-01

A10 A11 A12 ... A1,99
B00 B10 B20 ... B99,0shuffled into reducer-10

A10 A11 A12 ... A1,99
B01 B11 B21 ... B99,1shuffled into reducer-11
```

and the shuffling above is a `cogroup` transform, note that each reducer 
contains **only one group**.

**step-2**
Step-2 will do an `aggregateByKey` transform on the result of step-1, will 
also generate 4 reducers, and generate the final result RDD, contains 4 
partitions, each partition contains one block.

The main problems are in step-1. Now we have only 4 reducers, but matrix A 
and B have 400 blocks in total, obviously the reducer number is too small.
and, we can see that, each reducer contains only one group(the group 
concept in `coGroup` transform), each group contains 200 blocks. This is 
terrible because we know that `coGroup` transformer will load each group into 
memory when computing. It is un-extensable in the algorithm level. Suppose 
matrix A has 1 cols blocks or more instead of 100? Than each reducer will 
load 2 blocks into memory. It will easily cause reducer OOM.

This PR try to resolve the problem described above.
When matrix A with dimension M * N multiply matrix B with dimension N * P, 
the middle dimension N is the keypoint. If N is large, the current 
mulitiplication implementation works badly.
In this PR, I introduce a `midDimSplitNum` parameter, represent how many 
splits it will cut on the middle dimension N.
Still using the example described above, now we set `midDimSplitNum = 10`, 
now we can generate 40 reducers in **stage-1**:

the reducer-ij above now will be splited into 10 reducers: reducer-ij0, 
reducer-ij1, ... reducer-ij9, each reducer will receive 20 blocks.
now the shuffle works as following:

**reducer-000 to reducer-009**
```
A0,0 A0,10 A0,20 ... A0,90
B0,0 B10,0 B20,0 ... B90,0shuffled into reducer-000

A0,1 A0,11 A0,21 ... A0,91
B1,0 B11,0 B21,0 ... B91,0shuffled into reducer-001

A0,2 A0,12 A0,22 ... A0,92
B2,0 B12,0 B22,0 ... B92,0shuffled into reducer-002

...

A0,9 A0,19 A0,29 ... A0,99
B9,0 B19,0 B29,0 ... B99,0shuffled into reducer-009
```

**reducer-010 to reducer-019**
```
A0,0 A0,10 A0,20 ... A0,90
B0,1 B10,1 B20,1 ... B90,1shuffled into reducer-010

A0,1 A0,11 A0,21 ... A0,91
B1,1 B11,1 B21,1 ... B91,1shuffled into reducer-011

A0,2 A0,12 A0,22 ... A0,92
B2,1 B12,1 B22,1 ... B92,1shuffled into reducer-012

...

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplica...

2016-11-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15730
  
@yanboliang @sethah I will be very pleasure to hear your opinions, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2016-11-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15628
  
@sethah I think this is a good idea, I am going to add the same method to 
Vector, since high dimension data training on LIR/LOR with L1 reg, compressed 
coeffs storage will make the algo broadcast coeffs vector faster, how do you 
think about it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2016-11-03 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15628
  
@sethah Oh..I miss it.. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16576: [SPARK-19215] Add necessary check for `RDD.checkp...

2017-01-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16576#discussion_r96573963
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1539,6 +1539,9 @@ abstract class RDD[T: ClassTag](
 // NOTE: we use a global lock here due to complexities downstream with 
ensuring
 // children RDD partitions point to the correct parent partitions. In 
the future
 // we should revisit this consideration.
+if (doCheckpointCalled) {
+  logWarning(s"Because job has been executed on RDD ${id}, checkpoint 
won't work")
--- End diff --

reping @zsxwing Would you mind take a look? This is a simple PR but it will 
bring much help for spark developers to avoid them making some mistake usage...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Reduce shuffled data size of Bl...

2017-01-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15730
  
@brkyvz Also thanks for your careful code review! ^_^ 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-02-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
sethah 

About this issue:
Why is there a one-to-one overlap between MulticlassClassificationSummary 
and LogisticRegressionSummary, and MulticlassLogisticRegressionSummary inherits 
from them both?

If I merge the MulticlassLogisticRegressionSummary into 
LogisticRegressionSummary to remove the  one-to-one overlap between 
MulticlassClassificationSummary and LogisticRegressionSummary, it will cause 
**more API breaking**, because in this way it will make 
BinaryLogisticRegressionTrainingSummary cannot extends 
LogisticRegressionTrainingSummary and it will break some other public API such 
as `def summary`.
you can try to modify it and compile the code and will find this problem...
Maybe there is some better way but I haven't think out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-02-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah 

If I merge the MulticlassLogisticRegressionSummary into 
LogisticRegressionSummary,
then, according to the hierarchy currently designed, it became:

class LogisticRegressionSummary extends MulticlassSummary with 
LogisticRegressionSummary
class LogisticRegressionTrainingSummary extends LogisticRegressionSummary 
with 

** Note that now LogisticRegressionTrainingSummary must become a class, not 
a trait, if merge the MulticlassLogisticRegressionSummary into 
LogisticRegressionSummary, it has to be class...**

Now consider the `BinaryLogisticRegressionSummary`:

class BinaryLogisticRegressionSummary extends LogisticRegressionSummary
class BinaryLogisticRegressionTrainingSummary extends 
BinaryLogisticRegressionSummary

** Now new problem occur: BinaryLogisticRegressionTrainingSummary cannot 
extend LogisticRegressionTrainingSummary, because 
`LogisticRegressionTrainingSummary` has changed into a class, not a trait... **

** BinaryLogisticRegressionTrainingSummary cannot extend 
LogisticRegressionTrainingSummary cause more API breaking, such as `def 
summary`...**

So these problems are troublesome... for causing so many API breaking...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-02-03 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15435
  
@sethah 

About your new design,

```
Summary
PredictionSummary extends Summary
ClassificationSummary extends PredictionSummary
ProbabilisticClassificationSummary extends ClassificationSummary
MulticlassClassificationSummary extends ClassificationSummary
BinaryClassificationSummary extends MulticlassClassificationSummary

Then we could implement model summaries like: 
LogisticRegressionTrainingSummary extends MulticlassClassificationSummary with 
ProbabilisticClassificationSummary with TrainingSummary

```
suppose the `def summary` return `LogisticRegressionTrainingSummary` type,
how to deal with `BinaryClassificationSummary` ?
In this hierarchy, `BinaryClassificationSummary` cannot inherit from 
`LogisticRegressionTrainingSummary`,
but `def summary` method must support `Binary` case...
so it meets the similar problem I mentioned above...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15730: [SPARK-18218][ML][MLLib] Optimize BlockMatrix multiplica...

2016-12-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15730
  
@brkyvz All right, I'll update code ASAP. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19994: [SPARK-22810][ML][PySpark] Expose Python API for ...

2017-12-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19994#discussion_r157391801
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1725,6 +1725,27 @@ def test_offset(self):
 self.assertTrue(np.isclose(model.intercept, -1.561613, atol=1E-4))
 
 
+class LinearRegressionTest(SparkSessionTestCase):
+
+def test_linear_regression_with_huber_loss(self):
+
+data_path = "data/mllib/sample_linear_regression_data.txt"
+df = self.spark.read.format("libsvm").load(data_path)
+
+lir = LinearRegression(loss="huber")
--- End diff --

The testcase should include `setEpsilon`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19994: [SPARK-22810][ML][PySpark] Expose Python API for ...

2017-12-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19994#discussion_r157391859
  
--- Diff: python/pyspark/ml/regression.py ---
@@ -155,6 +183,14 @@ def intercept(self):
 """
 return self._call_java("intercept")
 
+@property
+@since("2.3.0")
+def scale(self):
+"""
+The value by which \|y - X'w\| is scaled down when loss is "huber".
--- End diff --

add doc "When square loss the value is 1.0"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19988: [Spark-22795] [ML] Raise error when line search i...

2017-12-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19988#discussion_r157393049
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -244,9 +244,9 @@ class LinearSVC @Since("2.2.0") (
   }
 
   bcFeaturesStd.destroy(blocking = false)
-  if (state == null) {
+  if (state == null || state.searchFailed) {
 val msg = s"${optimizer.getClass.getName} failed."
-logError(msg)
+logWarning(msg)
 throw new SparkException(msg)
--- End diff --

But the msg in the Exception should include the root reason "line search 
failed." Current msg only including "LBFGS failed."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19993: [SPARK-22799][ML] Bucketizer should throw excepti...

2017-12-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19993#discussion_r157393913
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
* by `inputCol`. A warning will be printed if both are set.
*/
   private[feature] def isBucketizeMultipleColumns(): Boolean = {
-if (isSet(inputCols) && isSet(inputCol)) {
-  logWarning("Both `inputCol` and `inputCols` are set, we ignore 
`inputCols` and this " +
-"`Bucketizer` only map one column specified by `inputCol`")
-  false
+if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && 
isSet(outputCol) ||
+  isSet(inputCol) && isSet(outputCols)) {
+  throw new IllegalArgumentException("Both `inputCol` and `inputCols` 
are set, `Bucketizer` " +
+"only supports setting either `inputCol` or `inputCols`.")
--- End diff --

Here it is better to add one more check:
if single column, only set `splits` param.
if multiple column, only set `splitsArray` param.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19988: [Spark-22795] [ML] Raise error when line search i...

2017-12-18 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19988#discussion_r157441600
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -244,9 +244,9 @@ class LinearSVC @Since("2.2.0") (
   }
 
   bcFeaturesStd.destroy(blocking = false)
-  if (state == null) {
+  if (state == null || state.searchFailed) {
 val msg = s"${optimizer.getClass.getName} failed."
-logError(msg)
+logWarning(msg)
 throw new SparkException(msg)
--- End diff --

Yes I know. But sometimes people maybe want to catch the exception and get 
the detail information from exception. It cannot be got from this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-12-18 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19621
  
@felixcheung Another failed testcase, spark.mlp in sparkR, it also use 
`RFormula` and it will also generate indeterministic result, see class 
`MultilayerPerceptronClassifierWrapper` line 78:
```
val rFormula = new RFormula()
  .setFormula(formula)
  .setForceIndexLabel(true)
  .setHandleInvalid(handleInvalid)
```
It can not set the string order and the default `frequencyDesc` order will 
bring indeterministic result.

If I only modify the testcase in `spark.mlp`, in the future if the 
`StringIndexer` implementation being changed, those tests will probably be 
broken again. What do you think of this ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19857: [SPARK-22667][ML][WIP] Fix model-specific optimiz...

2017-12-18 Thread WeichenXu123

Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/19857


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19857: [SPARK-22667][ML][WIP] Fix model-specific optimization s...

2017-12-18 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19857
  
The design of this issue changed. @MrBago will take this over.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19350: [SPARK-22126][ML][WIP] Fix model-specific optimiz...

2017-12-18 Thread WeichenXu123

Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/19350


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1170 matches

Mail list logo