[GitHub] spark issue #18904: [SPARK-21624]optimzie RF communicaiton cost

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18904
  
Thanks @MLnick, I will be glad if you can continue it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2018-01-15 Thread mpjlu
Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/18624


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2018-01-15 Thread mpjlu
Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/19516


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2018-01-15 Thread mpjlu
Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/19337


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

2018-01-15 Thread mpjlu
Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/19536


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17739: [SPARK-20443][MLLIB][ML] set ALS blockify size

2018-01-15 Thread mpjlu
Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/17739


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17739: [SPARK-20443][MLLIB][ML] set ALS blockify size

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/17739
  
Because I don't have the environment to continue this work, I will close 
it. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
Because I don't have the environment to continue this work, I will close 
it. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
Because I don't have the environment to continue this work, I will close 
it. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19536
  
Because I don't have the environment to continue this work, I will close 
it. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on prepari...

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19516
  
Because I don't have the environment to continue this work, I will close 
it. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18904: [SPARK-21624]optimzie RF communicaiton cost

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18904
  
Because I don't have the environment to continue this work, I will close 
it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18904: [SPARK-21624]optimzie RF communicaiton cost

2018-01-15 Thread mpjlu
Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/18904


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18904: [SPARK-21624]optimzie RF communicaiton cost

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18904
  
This is another case.
Table 1 shows the improvement of random tree algorithm with sparse 
expression. We can see that when we use sparse expression, I/O can be reduced 
by 61% and total run time can be reduced by 39%. The dataset has 100k samples 
and 10k features in Gaussian distribution and its number of partitions is 300. 
The max depth of RF is 17 and number of bins is 40.

![image](https://user-images.githubusercontent.com/13826327/34948723-f1f0a262-fa48-11e7-860b-b744daf6196d.png)

Only when the network is a bottleneck, this optimization will work better. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18904: [SPARK-21624]optimzie RF communicaiton cost

2018-01-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18904
  

![image](https://user-images.githubusercontent.com/13826327/34948104-2fa1982a-fa47-11e7-9312-f1935cca758b.png)
This is one of my test results.
Now, I am not working on Spark MLLIB, and don't have hardware to do more 
test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-12-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-12-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
Hi @holdenk, this is the PR we have discussed in Strata conference. 
![matrix 
multiply-topk-strata](https://user-images.githubusercontent.com/13826327/33865337-4e761a4a-df2c-11e7-8eed-1763429a78ea.jpg)

I have thought about the code again, for the performance, we can continue 
to optimize the code. Because we can merge the block matrices before shuffle. 
For only code simplification, I didn't get a good method. 
Any change needs many tests about the performance, and now I am in a DL 
team, and I will not have the test environment soon.
I will be glad if anyone wants to continue this work.
Thanks.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-12-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18868
  
For your case, this doesn't need to fix, this warning is right. Your data 
size is too large


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-11-10 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
Thanks @WeichenXu123 , I will think about the method to simplify the code. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-11-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18624#discussion_r150269984
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -286,40 +288,119 @@ object MatrixFactorizationModel extends 
Loader[MatrixFactorizationModel] {
   srcFeatures: RDD[(Int, Array[Double])],
   dstFeatures: RDD[(Int, Array[Double])],
   num: Int): RDD[(Int, Array[(Int, Double)])] = {
-val srcBlocks = blockify(srcFeatures)
-val dstBlocks = blockify(dstFeatures)
-val ratings = srcBlocks.cartesian(dstBlocks).flatMap { case (srcIter, 
dstIter) =>
-  val m = srcIter.size
-  val n = math.min(dstIter.size, num)
-  val output = new Array[(Int, (Int, Double))](m * n)
+val srcBlocks = blockify(rank, srcFeatures).zipWithIndex()
+val dstBlocks = blockify(rank, dstFeatures)
+val ratings = srcBlocks.cartesian(dstBlocks).map {
+  case (((srcIds, srcFactors), index), (dstIds, dstFactors)) =>
+val m = srcIds.length
+val n = dstIds.length
+val dstIdMatrix = new Array[Int](m * num)
+val scoreMatrix = Array.fill[Double](m * 
num)(Double.NegativeInfinity)
+val pq = new BoundedPriorityQueue[(Int, 
Double)](num)(Ordering.by(_._2))
+
+val ratings = srcFactors.transpose.multiply(dstFactors)
+var i = 0
+var j = 0
+while (i < m) {
+  var k = 0
+  while (k < n) {
+pq += dstIds(k) -> ratings(i, k)
+k += 1
+  }
+  k = 0
+  pq.toArray.sortBy(-_._2).foreach { case (id, score) =>
+dstIdMatrix(j + k) = id
+scoreMatrix(j + k) = score
+k += 1
+  }
+  // pq.size maybe less than num, corner case
+  j += num
+  i += 1
+  pq.clear()
+}
+(index, (srcIds, dstIdMatrix, new DenseMatrix(m, num, scoreMatrix, 
true)))
+}
+ratings.aggregateByKey(null: Array[Int], null: Array[Int], null: 
DenseMatrix)(
+  (rateSum, rate) => mergeFunc(rateSum, rate, num),
+  (rateSum1, rateSum2) => mergeFunc(rateSum1, rateSum2, num)
+).flatMap { case (index, (srcIds, dstIdMatrix, scoreMatrix)) =>
+  // to avoid corner case that the number of items is less than 
recommendation num
+  var col: Int = 0
+  while (col < num && scoreMatrix(0, col) > Double.NegativeInfinity) {
+col += 1
+  }
+  val row = scoreMatrix.numRows
+  val output = new Array[(Int, Array[(Int, Double)])](row)
   var i = 0
-  val pq = new BoundedPriorityQueue[(Int, 
Double)](n)(Ordering.by(_._2))
-  srcIter.foreach { case (srcId, srcFactor) =>
-dstIter.foreach { case (dstId, dstFactor) =>
-  // We use F2jBLAS which is faster than a call to native BLAS for 
vector dot product
-  val score = BLAS.f2jBLAS.ddot(rank, srcFactor, 1, dstFactor, 1)
-  pq += dstId -> score
+  while (i < row) {
+val factors = new Array[(Int, Double)](col)
+var j = 0
+while (j < col) {
+  factors(j) = (dstIdMatrix(i * num + j), scoreMatrix(i, j))
+  j += 1
 }
-pq.foreach { case (dstId, score) =>
-  output(i) = (srcId, (dstId, score))
-  i += 1
+output(i) = (srcIds(i), factors)
+i += 1
+  }
+ output.toSeq}
+  }
+
+  private def mergeFunc(rateSum: (Array[Int], Array[Int], DenseMatrix),
+rate: (Array[Int], Array[Int], DenseMatrix),
+num: Int): (Array[Int], Array[Int], DenseMatrix) = 
{
+if (rateSum._1 == null) {
+  rate
+} else {
+  val row = rateSum._3.numRows
+  var i = 0
+  val tempIdMatrix = new Array[Int](row * num)
+  val tempScoreMatrix = Array.fill[Double](row * 
num)(Double.NegativeInfinity)
+  while (i < row) {
+var j = 0
+var sum_index = 0
+var rate_index = 0
+val matrixIndex = i * num
+while (j < num) {
+  if (rate._3(i, rate_index) > rateSum._3(i, sum_index)) {
+tempIdMatrix(matrixIndex + j) = rate._2(matrixIndex + 
rate_index)
+tempScoreMatrix(matrixIndex + j) = rate._3(i, rate_index)
+rate_index += 1
+  } else {
+tempIdMatrix(matrixIndex + j) = rateSum._2(matrixIndex + 
sum_index)
+tempScoreMatrix(matrixIndex + j) = rateSum._3(i, sum_index)
+sum_index += 1
+  }
+   

[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2017-10-24 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19516#discussion_r146741139
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
 val featureAttributes: Array[Attribute] = if 
(origAttrGroup.attributes.nonEmpty) {
   origAttrGroup.attributes.get.zipWithIndex.filter(x => 
selector.contains(x._2)).map(_._1)
 } else {
-  Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
+  null
--- End diff --

Change the error description is ok, but ChiSqSelector maybe not the only 
case causes this error. StringIndexer, QuantileDiscretizer may also causes this 
error. 
Maybe we need a solution for this problem? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2017-10-24 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19516#discussion_r146537275
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
 val featureAttributes: Array[Attribute] = if 
(origAttrGroup.attributes.nonEmpty) {
   origAttrGroup.attributes.get.zipWithIndex.filter(x => 
selector.contains(x._2)).map(_._1)
 } else {
-  Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
+  null
--- End diff --

The problem is should we allow add Nominal and not set value or numvalues. 
If we allow this,  should we change the code of getCategoricalFeatures?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2017-10-24 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19516#discussion_r146528222
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
 val featureAttributes: Array[Attribute] = if 
(origAttrGroup.attributes.nonEmpty) {
   origAttrGroup.attributes.get.zipWithIndex.filter(x => 
selector.contains(x._2)).map(_._1)
 } else {
-  Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
+  null
--- End diff --

For this problem, either here or the following code should be changed, 
right? 
 " case nomAttr: NominalAttribute =>
  nomAttr.getNumValues match {
case Some(numValues: Int) => Iterator(idx -> numValues)
case None => throw new IllegalArgumentException(s"Feature 
$idx is marked as" +
  " Nominal (categorical), but it does not have the number 
of values specified.")
  }" 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2017-10-24 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19516#discussion_r146527616
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
 val featureAttributes: Array[Attribute] = if 
(origAttrGroup.attributes.nonEmpty) {
   origAttrGroup.attributes.get.zipWithIndex.filter(x => 
selector.contains(x._2)).map(_._1)
 } else {
-  Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
+  null
--- End diff --

This function is in ChiSqSelectorModel, in the model you only have 
"selectedFeatures: Array[Int]", seems you cannot fill values and or numValues 
only with the model information. 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-24 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
Any comments for this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

2017-10-24 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19536#discussion_r146511769
  
--- Diff: project/MimaExcludes.scala ---
@@ -1043,6 +1043,9 @@ object MimaExcludes {
   // [SPARK-21680][ML][MLLIB]optimzie Vector coompress
   
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.mllib.linalg.Vector.toSparseWithSize"),
   
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.linalg.Vector.toSparseWithSize")
+) ++ Seq(
+  // [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
+  
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.recommendation.ALS.train")
--- End diff --

I am ok to remove this, and use a loose threshold (e.g. 100), which is  
helpful for most cases. How about it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

2017-10-24 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19536
  
cc @srowen  @mengxr @yanboliang 
The performance data is updated, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

2017-10-24 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19536
  


![image](https://user-images.githubusercontent.com/13826327/31934047-8e886360-b8dd-11e7-996a-d734ac39bc5b.png)


100k*100k, sparsity   0.05 | old | old | New
-- | -- | -- | --
rank | f2j | mkl | mkl
20 | 4 | 9 | 4
50 | 13 | 12 | 13
100 | 40 | 60 | 30
200 | 156 | 90 | 66
500 | 900 | 720 | 270




This is the performance data of this PR.
The date size is 100,000*100,000, the data sparsity is 0.05.
The test environment is:
3 workers: 35 cores/worker, 210G memory/worker, 1 executor/worker.
The native BLAS is Intel MKL



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

2017-10-19 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19536#discussion_r145735791
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
@@ -589,6 +602,9 @@ class ALS(@Since("1.4.0") override val uid: String) 
extends Estimator[ALSModel]
   @Since("2.2.0")
   def setColdStartStrategy(value: String): this.type = 
set(coldStartStrategy, value)
 
+  @Since("2.3.0")
+  def setThreshold(value: Int): this.type = set(threshold, value)
--- End diff --

Yes, my test results is better than the previous results,  especially for 
native BLAS.  I will update my test results here soon, and I will change this 
set. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

2017-10-19 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19536
  
Hi @yanboliang , I reopen this PR per your suggestion, thanks.
I have tested the code, the performance result is matched with @hqzizania 
's results. 
Thanks @hqzizania .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

2017-10-19 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/19536

[SPARK-6685][ML]Use DSYRK to compute AtA in ALS

## What changes were proposed in this pull request?

This is a reopen of PR: https://github.com/apache/spark/pull/13891 with 
mima fix.
Please reference PR13891 for the performance result and discussion.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark 6658

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19536.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19536


commit 8fb4a82be98588795454127f4281363d105146f0
Author: hqzizania <qian.hu...@intel.com>
Date:   2016-06-24T06:26:31Z

add dsyrk to ALS

commit 7e3d238c62269f923832e7cba237f750082450a9
Author: hqzizania <qian.hu...@intel.com>
Date:   2016-06-24T16:28:57Z

implicitprefs ut fails fix

commit 3607bdcd22cf03c068e777a1927ee3d5da49045a
Author: hqzizania <qian.hu...@intel.com>
Date:   2016-06-28T13:51:55Z

use "while" loop instead of "for"
set stack size > 128 and comments added

commit 56194eb8dc3ea8c233dbb362eeb83af5091abcba
Author: hqzizania <qian.hu...@intel.com>
Date:   2016-06-29T10:57:57Z

add unit test for dostack ALS

commit dc4f4badba26635aa95c6ade5a589d4bd50ae886
Author: hqzizania <qian.hu...@intel.com>
Date:   2016-10-21T05:34:51Z

reset threshold values for doStack and remove UT

commit d29fd67a2a24675b7be2f7f51ba170fda11a85d7
Author: hqzizania <hqziza...@gmail.com>
Date:   2016-10-24T01:58:18Z

add threshold param to ALS

commit 294164d839b0ce191fee341b0eb82b81d506d8c8
Author: hqzizania <hqziza...@gmail.com>
Date:   2016-10-24T02:47:50Z

nit fix

commit 513e7915ecb807bc04ed8a17fdaa121e9ac578b5
Author: hqzizania <hqziza...@gmail.com>
Date:   2016-10-24T02:56:39Z

nit

commit f56b586092a24c1010326ed7f94b6b1eb427d378
Author: hqzizania <hqziza...@gmail.com>
Date:   2016-10-24T04:41:14Z

Merge remote-tracking branch 'origin/master' into ALSdsyrk

commit 1081e64c3fbd31c3d35b987b3200eae8c8c688e2
Author: hqzizania <hqziza...@gmail.com>
Date:   2016-10-24T04:44:52Z

mima fix

commit a6b5a16cd78e4efe99fda40f92592c9712b04146
Author: hqzizania <hqziza...@gmail.com>
Date:   2016-10-24T05:04:39Z

oops

commit 20774575ad51d2a10e47c8f8ee7581a1519ace6b
Author: hqzizania <hqziza...@gmail.com>
Date:   2016-10-25T17:28:45Z

solve mima failure

commit 061df755c80bbb050dcc85605f382b11e974f872
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-10-19T11:22:21Z

reopen SPARK-6685




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19516: [SPARK-22277][ML]fix the bug of ChiSqSelector on ...

2017-10-17 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/19516

[SPARK-22277][ML]fix the bug of ChiSqSelector on preparing the output column

## What changes were proposed in this pull request?

To prepare the output columns when use ChiSqSelector,  the master method 
adds some additional feature attribute, this is not necessary, and sometimes 
cause error. 

`val featureAttributes: Array[Attribute] = if 
(origAttrGroup.attributes.nonEmpty) {
  origAttrGroup.attributes.get.zipWithIndex.filter(x => 
selector.contains(x._2)).map(_._1)
} else {
  Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
}
val newAttributeGroup = new AttributeGroup($(outputCol), 
featureAttributes)` 

## How was this patch tested?
The existing UT.  


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark testDFdirect

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19516.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19516


commit 3128133d76348666df82bf43aa42cd9ebae70faf
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-10-17T13:04:08Z

fix the bug of ChiSqSelector on preparing the output column




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
For the comments about change the name of epsilon and add setter in 
localLADModel, we have agreed not to change it now after some offline 
discussion. 
Because epsilon doesn't control model convergence directly, and some other 
LDA implementations like Vowpal Vabbit also uses this name. 
Because there are many parameters in LDA, epsilon is just one of them, now 
there is no setter for any of them. If we need to add setter of them, we maybe 
add them together in another PR. 
Thanks. @hhbyyh @jkbradley 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-11 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r144060858
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,24 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * A (positive) learning parameter that controls the convergence of 
variational inference.
+   * Smaller value will lead to more accuracy model and longer training 
time.
--- End diff --

Thanks, I will update the doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r143723408
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+  k: Int,
+  epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = {
--- End diff --

Maybe not need to set, we can add epsilon as a parameter in the model to 
save it. Because most of training parameters are not in the model, so we don't 
add epsilon to the model here. Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r143705102
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -119,6 +121,8 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 assert(lda.getLearningDecay === 0.53)
 lda.setLearningOffset(1027)
 assert(lda.getLearningOffset === 1027)
+lda.setEpsilon(1e-3)
--- End diff --

Thanks @mgaido91 , I will update the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
Thanks, @hhbyyh.
 I will create a JIRA for python API  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-27 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r141272887
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * @group expertParam
+   */
+  @Since("2.3.0")
+  final val epsilon = new DoubleParam(this, "epsilon", "(For online 
optimizer)" +
+" A (positive) learning parameter that controls the convergence of 
variational inference.",
+ParamValidators.gt(0))
--- End diff --

Yes, use 0.0 is good. Because other Double in this class is using 0, to 
keep the same code style, I also use 0 here.
It is ok to change all 0 to 0.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
Thanks @mgaido91 , I will update the ML api, maybe also python and java API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
For LDA, the implementation is in mllib, ml calls mllib.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
Ok, thanks. we don't need to change the code here. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
Sorry, I got wrong. 
So you think assert is better here? now we use require.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
Because epsilon is Double,  negative value should not cause the code run 
into dead loop. All other setting in LDA using require for check or no check. 
Should we use assert only for this change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
OK, I will change it to assert.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/19337
  
Not check is also ok, user should know epsilon > 0 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-25 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r140727204
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+  k: Int,
+  epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = {
--- End diff --

This is for some other code,  like in LDAModel, which calls this function 
and does not need to set epsilon. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-25 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/19337

[SPARK-22114][ML][MLLIB]add epsilon for LDA

## What changes were proposed in this pull request?

The current convergence condition of OnlineLDAOptimizer is:
while(meanGammaChange > 1e-3)
The condition is critical for the performance and accuracy of LDA. 
We should keep this configurable, like it is in Vowpal Vabbit: 

https://github.com/JohnLangford/vowpal_wabbit/blob/430f69453bc4876a39351fba1f18771bdbdb7122/vowpalwabbit/lda_core.cc
 :638

## How was this patch tested?

The existing UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark setLDA

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19337.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19337


commit a6c3c79efe4d77ef4beba3f431fd9c2527735875
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-09-25T09:12:47Z

add epsilon for LDA




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-09-04 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-09-04 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
Thanks @MLnick , I think the ML ALS suite is ok, just MLLIB ALS suite is 
too simple. One possible enhancement is to add the same test cases as ML ALS 
suite. How do you think about it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-20 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
Thanks @MLnick . I have double checked my test.
Since there is no  recommendForUserSubset , my previous test is MLLIB 
MatrixFactorizationModel::predict(RDD(Int, Int)), which predicts the rating of 
many users for many products. The performance of this function is low comparing 
with recommendForAll. 
This PR calls recommendForAll with a subset of the users, I agree with your 
test results. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimize Vector compress

2017-08-15 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
I have tested the performance of toSparse and toSparseWithSize separately.  
There is about 35% performance improvement for this change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18904: [SPARK-21624]optimzie RF communicaiton cost

2017-08-13 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18904
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-13 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
Thanks @sethah @srowen . The comment is added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
I did not only test this PR. Only work for PR 18904 and find this 
performance difference. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
Hi @srowen; how about using our first version? though duplicate some code, 
but change is small.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
For PR-18904, before this change, one iteration is about 58s, after this 
change, one iteration is about:40s


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18904: [SPARK-21624]optimzie RF communicaiton cost

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18904
  
A gentle ping: @sethah @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
Hi @sethah , the unit test is added.  Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18899#discussion_r132610165
  
--- Diff: project/MimaExcludes.scala ---
@@ -1012,6 +1012,10 @@ object MimaExcludes {
   
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.classification.RandomForestClassificationModel.setFeatureSubsetStrategy"),
   
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees"),
   
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.regression.RandomForestRegressionModel.setFeatureSubsetStrategy")
+) ++ Seq(
+  // [SPARK-21680][ML][MLLIB]optimzie Vector coompress
--- End diff --

The error message is"method toSparse(nnz: Int) in trait is present only in 
current version"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18899#discussion_r132610049
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala ---
@@ -635,8 +642,9 @@ class SparseVector @Since("2.0.0") (
 nnz
   }
 
-  override def toSparse: SparseVector = {
-val nnz = numNonzeros
+  override def toSparse: SparseVector = toSparse(numNonzeros)
--- End diff --

If define 
`def toSparse: SparseVector = toSparse(numNonzeros)`
in the superclass, when call dv.toSparse (there are this kinds of call in 
the code), there will be error message:
Both toSparse in the DenseVector of type (nnz:Int) 
org.apache.spark.ml.linalg.SparseVector and toSparse in trait Vector of type 
=>org.apache.spark.ml.linalg.SparseVector match  .
So we should change the name of toSparse(nnz: Int), maybe 
toSparseWithSize(nnz: Int).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18868
  
Yes, that is right. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
Thanks @srowen. 
I will revise the code per your suggestion.
when I wrote the code, I just concerned user call toSparse(size) and give a 
very small size.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18904: [SPARK-21624]optimzie RF communicaiton cost

2017-08-10 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/18904

[SPARK-21624]optimzie RF communicaiton cost

## What changes were proposed in this pull request?

The implementation of RF is bound by either the cost of statistics 
computation on workers or by communicating the sufficient statistics.
This PR will focus on optimizing communication cost.
The statistics are stored in allStats:
`  private var allStats: Array[Double] = new Array[Double](allStatsSize)`
In real use case, the size of allStats is very large, and it can be very 
sparse, especially on the nodes that near the leave of the tree. 
It is better to change allStats from Array to SparseVector before shufffle.
My tests show the communication is down by about **50% to 90%** with this 
PR. 
Test cases: 
1000 features * 200Bins* 2 label, 
1000 features * 50Bins * 2label
1 features * 200Bins * 2 label
featureSubsetStrategy: auto, 0.2, 0.1
All test cases with the config: spark.shuffle.compress=true.

## How was this patch tested?
The exist UT



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark optRFcomm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18904.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18904


commit 35d1f244f918bd8ea7fe7fdf10796a64e7a62fc9
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-08-10T07:19:59Z

optimzie RF communicaiton cost




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18899
  
Yes, I just concern if add toSparse(size) we should check the size in the 
code, there will be no performance gain. If we don't need to check the "size" 
(comparing size with numNonZero) in the code, add toSparse(size) is ok. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-09 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/18899

[SPARK-21680][ML][MLLIB]optimzie Vector coompress

## What changes were proposed in this pull request?

When use Vector.compressed to change a Vector to SparseVector, the 
performance is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using 
Vector.compressed, but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance 
difference between this two method.

## How was this patch tested?

The existing UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark optVectorCompress

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18899.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18899


commit 5dc5c89242a0c2a5ac6a693c3703eef8ee160616
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-08-10T01:59:17Z

optimzie Vector coompress




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-08-07 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18868
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-08-07 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18868#discussion_r131605981
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1107,9 +1108,11 @@ private[spark] object RandomForest extends Logging {
 mutableTreeToNodeToIndexInfo
   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
   = new NodeIndexInfo(numNodesInGroup, featureSubset)
+numNodesInGroup += 1
+memUsage += nodeMemUsage
+  } else {
+flag = false
--- End diff --

We cannot clear stack here, because the nodes in the stack are used for all 
iteration. 
Native scala doesn't support break, we have to import other package to use 
break, seems not necessary. 
We cannot use nodeStack.pop at the top, because we don't know whether it 
can be grouped here.
How about change "flag" to "groupDone" means Group is done, don't need to 
loop again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-07 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18832
  
Thanks @srowen , I revised the comments per Seth's suggestion: "Parent 
stats need to be explicitly tracked in the DTStatsAggregator because the parent 
[[Node]] object does not have ImpurityStats on the first iteration."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-08-07 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/18868

[SPARK-21638][ML]Fix RF/GBT Warning message error

## What changes were proposed in this pull request?

When train RF model, there are many warning messages like this:

> WARN  RandomForest: Tree learning is using approximately 268492800 bytes 
per iteration, which exceeds requested limit maxMemoryUsage=268435456. This 
allows splitting 2622 nodes in this iteration.

This warning message is unnecessary and the data is not accurate.

Actually, if all the nodes cannot split in one iteration, it will show this 
warning. For most of the case, all the nodes cannot split just in one 
iteration, so for most of the case, it will show this warning for each 
iteration.

## How was this patch tested?
The existing UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark fixRFwarning

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18868


commit f15f35e53115babd48357ec256f81e92e3480a6e
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-08-07T08:05:24Z

fix RF warning




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18832
  
Thanks @sethah .
I strongly think we should update the commend or just delete the comment as 
the current PR.
Another reason is: there are three kinds of feature: categorical, ordered 
categorical, and continuous
Only the first iteration of categorical feature need parentStats, the other 
two don't need. The comment seems all first iteration need parentStats. 
   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18832
  
I agree with you.  Do you think we should update the comment to help others 
understand the code.
Since parantStats is updated and used in each iteration.
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18832
  
I know your point.
I am confusing the code doesn't work that way.
The code update parentStats for each iteration. Actually, we only need to 
update parentStats for the first Iteration.   
So we should update the code? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18832
  
parentStats is used in this 
code: binAggregates.getParentImpurityCalculator(), this is used in all 
iteration.
So that comment seems very misleading.
`} else if (binAggregates.metadata.isUnordered(featureIndex)) {
  // Unordered categorical feature
  val leftChildOffset = 
binAggregates.getFeatureOffset(featureIndexIdx)
  val (bestFeatureSplitIndex, bestFeatureGainStats) =
Range(0, numSplits).map { splitIndex =>
  val leftChildStats = 
binAggregates.getImpurityCalculator(leftChildOffset, splitIndex)
  val rightChildStats = 
binAggregates.getParentImpurityCalculator()
.subtract(leftChildStats)
  gainAndImpurityStats = 
calculateImpurityStats(gainAndImpurityStats,
leftChildStats, rightChildStats, binAggregates.metadata)
  (splitIndex, gainAndImpurityStats)
}.maxBy(_._2.gain)
  (splits(featureIndex)(bestFeatureSplitIndex), 
bestFeatureGainStats)
}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18832
  
node.stats is ImpurityStats, and parentStats is Array[Double], there are 
different. Maybe this comment should be used on node.stats, but not on 
parentStats. Is my understanding wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/18832

[SPARK-21623][ML]fix RF doc

## What changes were proposed in this pull request?

comments of parentStats in RF are wrong.
parentStats is not only used for the first iteration, it is used with all 
the iteration for unordered features.

## How was this patch tested?



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark fixRFDoc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18832.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18832


commit 83c75043fee2a20f1eb6298bd2dab1259409c3ef
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-08-03T08:26:51Z

fix RF doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-01 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
Thanks.
This is my test setting:
3 workers, each: 40 cores, 196G memory,  1 executor.
Data Size: user 480,000, item 17,000


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-31 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
Did you test the performance of this, I tested the performance of MLLIB  
recommendForUserSubset some days ago, the performance is not good. Suppose the 
time of recommendForAll is 35s, recommend for 1/3 Users use this may need 90s. 
Maybe it is faster to use recommendForAll then select 1/3 users.  But if 
recommend tens or hundreds of users, this is faster than recommendForAll. So 
should we add come commends in the code about when it is better to use 
recommendForUserSubset. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
Hi @srowen @MLnick @jkbradley @mengxr @yanboliang 
Is this change acceptable?  if it is acceptable,  I will update ALS ML code 
following this method. Also update Test Suite, which are too simple, can not 
detect ALS errors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
Thanks @srowen ,  my test also said pq.poll is a little faster on some 
cases.
One possible benefit here is if we provide pq.poll, user's first choice may 
use pq.poll, not pq.toArray.sorted, which may causes performance reduction. As 
I have encounter for https://github.com/apache/spark/pull/18624


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
I am ok to close this. Thanks @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
My micro benchmark (write a program only test pq.toArray.sorted and 
pq.Array.sortBy and pq.poll), not find significant performance difference. Only 
in the Spark job, there is big difference. Confused.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
I also very confused about this. You can change 
https://github.com/apache/spark/pull/18624 to sorted and test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
Hi @MLnick , @srowen .
My test showing: pq.poll is not significantly faster than 
pq.toArray.sortBy, but significantly faster than pq.toArray.sorted.  Seems not 
each pq.toArray.sorted (such as used in topByKey) can be replaced by 
pq.toArray.sortBy, so use pq.poll to replace pq.toArray.sorted will benefit.
You can compare the performance of pq.sorted, pq.sortBy, and pq.poll using: 
 https://github.com/apache/spark/pull/18624
The performance of pq.toArray.sortBy is about the same as pq.poll, and 
about 20% improvement comparing pq.toArray.sorted. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-17 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
Hi @MLnick , 
pq.toArray.sorted also used in other places, like word2vector and LDA, how 
about waiting for my other benchmark results. Then decide to close it or not. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-17 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
I have tested much about poll and toArray.sorted. 
If the queue is much ordered (suppose offer 2000 times for queue size 20). 
Use pq.toArray.sorted is faster.
If the queue is much disordered (suppose offer 100 times for queue size 
20), Use pq.poll is much faster.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-17 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
Keep it or close it, both is ok for me. We have much discussion on:
https://issues.apache.org/jira/browse/SPARK-21401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-07-17 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18624#discussion_r127669102
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -286,40 +288,120 @@ object MatrixFactorizationModel extends 
Loader[MatrixFactorizationModel] {
   srcFeatures: RDD[(Int, Array[Double])],
   dstFeatures: RDD[(Int, Array[Double])],
   num: Int): RDD[(Int, Array[(Int, Double)])] = {
-val srcBlocks = blockify(srcFeatures)
-val dstBlocks = blockify(dstFeatures)
-val ratings = srcBlocks.cartesian(dstBlocks).flatMap { case (srcIter, 
dstIter) =>
-  val m = srcIter.size
-  val n = math.min(dstIter.size, num)
-  val output = new Array[(Int, (Int, Double))](m * n)
+val srcBlocks = blockify(rank, srcFeatures).zipWithIndex()
+val dstBlocks = blockify(rank, dstFeatures)
+val ratings = srcBlocks.cartesian(dstBlocks).map {
+  case (((srcIds, srcFactors), index), (dstIds, dstFactors)) =>
+val m = srcIds.length
+val n = dstIds.length
+val dstIdMatrix = new Array[Int](m * num)
+val scoreMatrix = Array.fill[Double](m * 
num)(Double.NegativeInfinity)
+val pq = new BoundedPriorityQueue[(Int, 
Double)](num)(Ordering.by(_._2))
+
+val ratings = srcFactors.transpose.multiply(dstFactors)
+var i = 0
+var j = 0
+while (i < m) {
+  var k = 0
+  while (k < n) {
+pq += dstIds(k) -> ratings(i, k)
+k += 1
+  }
+  var size = pq.size
+  while (size > 0) {
+size -= 1
+val factor = pq.poll()
--- End diff --

Hi @MLnick , thanks for your review.
My original test for sorted is using: pq.toArray.sorted(Ordering.By[(Int, 
Double), Double](-_._2)), 
because pq.toArray.sorted(-_._2) build error. Maybe there is 
boxing/unboxing, the  performance is very bad.
Now, I use pq.toArray.sortBy(-_._2), the performance is good than poll. 
this 25s vs poll 26s.
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-07-17 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18624#discussion_r127641933
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -286,40 +288,120 @@ object MatrixFactorizationModel extends 
Loader[MatrixFactorizationModel] {
   srcFeatures: RDD[(Int, Array[Double])],
   dstFeatures: RDD[(Int, Array[Double])],
   num: Int): RDD[(Int, Array[(Int, Double)])] = {
-val srcBlocks = blockify(srcFeatures)
-val dstBlocks = blockify(dstFeatures)
-val ratings = srcBlocks.cartesian(dstBlocks).flatMap { case (srcIter, 
dstIter) =>
-  val m = srcIter.size
-  val n = math.min(dstIter.size, num)
-  val output = new Array[(Int, (Int, Double))](m * n)
+val srcBlocks = blockify(rank, srcFeatures).zipWithIndex()
+val dstBlocks = blockify(rank, dstFeatures)
+val ratings = srcBlocks.cartesian(dstBlocks).map {
+  case (((srcIds, srcFactors), index), (dstIds, dstFactors)) =>
+val m = srcIds.length
+val n = dstIds.length
+val dstIdMatrix = new Array[Int](m * num)
+val scoreMatrix = Array.fill[Double](m * 
num)(Double.NegativeInfinity)
+val pq = new BoundedPriorityQueue[(Int, 
Double)](num)(Ordering.by(_._2))
+
+val ratings = srcFactors.transpose.multiply(dstFactors)
+var i = 0
+var j = 0
+while (i < m) {
+  var k = 0
+  while (k < n) {
+pq += dstIds(k) -> ratings(i, k)
+k += 1
+  }
+  var size = pq.size
+  while (size > 0) {
+size -= 1
+val factor = pq.poll()
--- End diff --

When num = 20, if use sorted here, the prediction time is about 31s,  if 
use poll, the prediction time is about 26s. I think this difference is large. I 
have tested many times. The result is about the same.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-16 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
I have checked the results with the master method, the recommendation 
results are right. 
The master TestSuite is too simple, should be updated.  I will update it.
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-14 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
If no poll, we have to use toArray.sorted, which performance is bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-14 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
We need the value is in order here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-14 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
An user block, after Cartesian, will generate many blocks(Number of Item 
blocks), all these blocks should be aggregated.  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-14 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-07-13 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/17742
  
I have submitted PR for ALS optimization with GEMM. and it is ready for 
review. 
The performance is about 50% improvement comparing with the master method.
https://github.com/apache/spark/pull/18624


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-13 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
Hi @srowen , I have added Test Suite for BoundedPriorityQueue. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-07-13 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18624#discussion_r127214361
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -286,40 +288,124 @@ object MatrixFactorizationModel extends 
Loader[MatrixFactorizationModel] {
   srcFeatures: RDD[(Int, Array[Double])],
   dstFeatures: RDD[(Int, Array[Double])],
   num: Int): RDD[(Int, Array[(Int, Double)])] = {
-val srcBlocks = blockify(srcFeatures)
-val dstBlocks = blockify(dstFeatures)
-val ratings = srcBlocks.cartesian(dstBlocks).flatMap { case (srcIter, 
dstIter) =>
-  val m = srcIter.size
-  val n = math.min(dstIter.size, num)
-  val output = new Array[(Int, (Int, Double))](m * n)
-  var i = 0
-  val pq = new BoundedPriorityQueue[(Int, 
Double)](n)(Ordering.by(_._2))
-  srcIter.foreach { case (srcId, srcFactor) =>
-dstIter.foreach { case (dstId, dstFactor) =>
-  // We use F2jBLAS which is faster than a call to native BLAS for 
vector dot product
-  val score = BLAS.f2jBLAS.ddot(rank, srcFactor, 1, dstFactor, 1)
-  pq += dstId -> score
-}
-pq.foreach { case (dstId, score) =>
-  output(i) = (srcId, (dstId, score))
+val srcBlocks = blockify(rank, srcFeatures).zipWithIndex()
+val dstBlocks = blockify(rank, dstFeatures)
+val ratings = srcBlocks.cartesian(dstBlocks).map {
+  case (((srcIds, srcFactors), index), (dstIds, dstFactors)) =>
+val m = srcIds.length
+val n = dstIds.length
+val dstIdMatrix = new Array[Int](m * num)
+val scoreMatrix = Array.fill[Double](m * num)(Double.MinValue)
+val pq = new BoundedPriorityQueue[(Int, 
Double)](num)(Ordering.by(_._2))
+
+val ratings = srcFactors.transpose.multiply(dstFactors)
+var i = 0
+var j = 0
+while (i < m) {
+  var k = 0
+  while (k < n) {
+pq += dstIds(k) -> ratings(i, k)
+k += 1
+  }
+  var size = pq.size
+  while(size > 0) {
--- End diff --

Do you mean add an isEmpty method for PriorityQueue? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-07-13 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/18624

[SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by gemm with about 
50% performance improvement

## What changes were proposed in this pull request?

In Spark 2.2, we have optimized ALS recommendForAll, which uses a 
handwriting matrix multiplication, and get the topK items for each matrix. The 
method effectively reduce the GC problem. However, Native BLAS GEMM, like Intel 
MKL, and OpenBLAS, the performance of matrix multiplication is about 10X 
comparing with handwriting method.

I have rewritten the code of recommendForAll with GEMM, and got about 50% 
improvement comparing with the master recommendForAll method.

The key point of this optimization:
1), use GEMM to replace hand-written matrix multiplication. 
2), Use matrix to keep temp result, largely reduce GC and computing time.  
The master method create many small objects, which causes using GEMM directly 
cannot get good performance.
3), Use sort and merge to get the topK items, which don't need to call 
priority queue two times.

Test Result:
479818 users, 13727 products, rank = 10, topK = 20.
3 workers, each with 35 cores. Native BLAS is Intel MKL.
Block Size:   1000===2000===4000===8000 
Master Method:40s-39.4s-39.5s39.1s
This Method 26.5s---25.9s26s-27.1s

Performance Improvement: (OldTime - NewTime)/NewTime = about 50%

 

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark OptimizeAlsByGEMM

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18624.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18624


commit 5ca3fd1d8e9d5fa6ae0daf24c83e72ef96045104
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-07-13T07:33:45Z

add poll for PriorityQueue

commit 215efc3114012ebc19af984a3d0172aecb22f255
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-07-13T10:39:44Z

test pass

commit 7c587f4070c0951425d1686429816feb712c0273
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-07-13T11:08:41Z

fix bug

commit e8a40edb25db8a6ecdfe67bd54f38071e7a99781
Author: Peng Meng <peng.m...@intel.com>
Date:   2017-07-13T11:56:50Z

code style change




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18620: [MINOR][ML][MLLIB] add poll function for BoundedPriority...

2017-07-13 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
Ok, thanks @srowen .
I will create a JIRA, and show the usage and performance comparing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >