[GitHub] spark pull request #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formul...

2016-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14922#discussion_r77352634 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -295,6 +295,13 @@ class LogisticRegression @Since

[GitHub] spark pull request #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formul...

2016-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14922#discussion_r77354917 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala --- @@ -405,5 +405,9 @@ private[ml] trait HasAggregationDepth

[GitHub] spark pull request #14923: [SPARK-17363][ML][MLLib] fix MultivariantOnlineSu...

2016-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14923#discussion_r77355534 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala --- @@ -231,9 +231,9 @@ class

[GitHub] spark pull request #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formul...

2016-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14922#discussion_r77356253 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -295,6 +295,13 @@ class LogisticRegression @Since

[GitHub] spark pull request #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formul...

2016-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14922#discussion_r77358835 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -295,6 +295,13 @@ class LogisticRegression @Since

[GitHub] spark pull request #14923: [SPARK-17363][ML][MLLib] fix MultivariantOnlineSu...

2016-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14923#discussion_r77361651 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala --- @@ -231,9 +231,9 @@ class

[GitHub] spark pull request #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formul...

2016-09-02 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14922#discussion_r77362198 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala --- @@ -405,5 +405,9 @@ private[ml] trait HasAggregationDepth

[GitHub] spark pull request #14950: [SPARK-17390][ML][MLLib] Optimize MultivariantOnl...

2016-09-03 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14950 [SPARK-17390][ML][MLLib] Optimize MultivariantOnlineSummerizer by making the summarized target configurable ## What changes were proposed in this pull request? add a mask parameter

[GitHub] spark issue #14950: [SPARK-17390][ML][MLLib] Optimize MultivariantOnlineSumm...

2016-09-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14950 @srowen not only cpu cost, if data dimension is big, serialization cost will be big, such as https://github.com/apache/spark/pull/14109 and compute all target seems not proper if we may add

[GitHub] spark pull request #15045: [Spark Core][MINOR] fix partitionBy error message

2016-09-10 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/15045 [Spark Core][MINOR] fix partitionBy error message ## What changes were proposed in this pull request? In order to avoid confusing user, it is better to change

[GitHub] spark issue #15045: [Spark Core][MINOR] fix partitionBy error message

2016-09-10 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15045 oh, there are 5 similar messages.. I check the others, the others may be set the default one, so I update their message as "Specified or default partitioner..." but

[GitHub] spark issue #14898: [SPARK-16499][ML][MLLib] optimize ann algorithm where us...

2016-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14898 @srowen but here, `delta -= target` breeze lib will call BLAS and it will usually be 10x faster than normal loop because it use SIMD instructions. Here is some performance information

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/15051 [SPARK-17499][ML][MLLib] make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier ## What changes were proposed in this pull request? update several

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15045 jenkins test please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r78307552 --- Diff: R/pkg/R/mllib.R --- @@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"), #' } #

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r78308116 --- Diff: R/pkg/R/mllib.R --- @@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"), #' } #

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r78309230 --- Diff: R/pkg/R/mllib.R --- @@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"), #' } #

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15045 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r78315763 --- Diff: R/pkg/R/mllib.R --- @@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"), #' } #

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r78315909 --- Diff: R/pkg/R/mllib.R --- @@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"), #' } #

[GitHub] spark pull request #15060: [SPARK-17507][ML][MLLib] check weight vector size...

2016-09-12 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/15060 [SPARK-17507][ML][MLLib] check weight vector size in ANN ## What changes were proposed in this pull request? as the TODO described, check weight vector size and if wrong throw

[GitHub] spark issue #15059: [SPARK-17506][SQL] Improve the check double values equal...

2016-09-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15059 but relTol is defined in mllib and sql not reference it, seems better to move it to spark-core project? --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formul...

2016-09-13 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/14922 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #14950: [WIP][SPARK-17390][ML][MLLib] Optimize Multivaria...

2016-09-13 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/14950 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #14950: [WIP][SPARK-17390][ML][MLLib] Optimize MultivariantOnlin...

2016-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14950 when benchmark is done I will reopen it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formula to ag...

2016-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14922 all right. when refining & benchmark is done I will reopen it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pro

[GitHub] spark issue #15060: [SPARK-17507][ML][MLLib] check weight vector size in ANN

2016-09-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15060 @srowen the `weight` by default will randomly generated and it will automatically match the size, only when it is specified by user it will need this check... now the modification here seems

[GitHub] spark pull request #15060: [SPARK-17507][ML][MLLib] check weight vector size...

2016-09-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15060#discussion_r78749771 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala --- @@ -235,6 +235,7 @@ class

[GitHub] spark pull request #15097: [SPARK-17540][SparkR][Spark Core] fix SparkR arra...

2016-09-14 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/15097 [SPARK-17540][SparkR][Spark Core] fix SparkR array serde type problem when length == 0 ## What changes were proposed in this pull request? fix SparkR array serde type problem when

[GitHub] spark issue #15097: [SPARK-17540][SparkR][Spark Core] fix SparkR array serde...

2016-09-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15097 @shivaram Oh...I found this way still has problems, Array[Nothing] in scala, after compiling with type erasing, at last it turned into type `Ljava.lang.object`, but primitive type Array

[GitHub] spark issue #14851: [SPARK-17281][ML][MLLib] Add treeAggregateDepth paramete...

2016-09-15 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14851 cc @srowen thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #15051: [SPARK-17499][SparkR][ML][MLLib] make the default params...

2016-09-17 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15051 @felixcheung Now I add some test using default parameter and compare the output prediction with the result generated using scala-side code. thanks! --- If your project is set up for it

[GitHub] spark pull request #15051: [SPARK-17499][SparkR][ML][MLLib] make the default...

2016-09-17 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r79295910 --- Diff: R/pkg/R/mllib.R --- @@ -694,8 +694,14 @@ setMethod("predict", signature(object = "KMeansModel"), #' } #

[GitHub] spark issue #15051: [SPARK-17499][SparkR][ML][MLLib] make the default params...

2016-09-17 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15051 @felixcheung yeah, in fact 0x7FFF is not ideal because itself also a valid seed. and there is another problem, in scala, seed is `long` type, but in R side, it seems there is no

[GitHub] spark pull request #15051: [SPARK-17499][SparkR][ML][MLLib] make the default...

2016-09-17 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r79297392 --- Diff: R/pkg/R/mllib.R --- @@ -694,8 +694,14 @@ setMethod("predict", signature(object = "KMeansModel"), #' } #

[GitHub] spark pull request #15051: [SPARK-17499][SparkR][ML][MLLib] make the default...

2016-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r79299954 --- Diff: R/pkg/R/mllib.R --- @@ -694,8 +694,14 @@ setMethod("predict", signature(object = "KMeansModel"), #' } #

[GitHub] spark issue #15051: [SPARK-17499][SparkR][ML][MLLib] make the default params...

2016-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15051 @felixcheung Now I update the scala-side wapper args type as following: layers: Array[Int], seed: String and the seed default value currently I use "", not NUL

[GitHub] spark issue #15051: [SPARK-17499][SparkR][ML][MLLib] make the default params...

2016-09-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15051 @felixcheung negative test added, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request #15051: [SPARK-17499][SparkR][ML][MLLib] make the default...

2016-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r79361373 --- Diff: R/pkg/R/mllib.R --- @@ -695,17 +695,15 @@ setMethod("predict", signature(object = "KMeansModel"), #' @n

[GitHub] spark pull request #15051: [SPARK-17499][SparkR][ML][MLLib] make the default...

2016-09-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/15051#discussion_r79362082 --- Diff: R/pkg/R/mllib.R --- @@ -695,17 +695,15 @@ setMethod("predict", signature(object = "KMeansModel"), #' @n

[GitHub] spark issue #14852: [WIP][SPARK-17138][ML][MLib] Add Python API for multinom...

2016-09-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14852 @sethah OK. I will study the unified scala API for LOR and update the python-side api PR ASAP. Thanks! --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #14852: [SPARK-17138][ML][MLib] Add Python API for multinomial l...

2016-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14852 cc @sethah @yanboliang thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #14852: [SPARK-17138][ML][MLib] Add Python API for multinomial l...

2016-09-23 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14852 Done. thanks! @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #14852: [SPARK-17138][ML][MLib] Add Python API for multinomial l...

2016-09-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14852 Done. thanks for careful review :) @sethah --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #15097: [SPARK-17540][SparkR][Spark Core] fix SparkR array serde...

2016-09-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15097 @felixcheung I think out two ways for this problem, see the PR description. which is better in your opinion? Or whether it exists better solution? --- If your project is set up

[GitHub] spark pull request #14203: update python dataframe.drop

2016-07-14 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14203 update python dataframe.drop ## What changes were proposed in this pull request? Make `dataframe.drop` API in python support multi-columns parameters, so that it is the same with

[GitHub] spark pull request #14203: [SPARK-16546][SQL][PySpark] update python datafra...

2016-07-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14203#discussion_r70913944 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1416,13 +1416,25 @@ def drop(self, col): >>> df.join(df2, df.name ==

[GitHub] spark pull request #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary mi...

2016-07-14 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14216 [SPARK-16561][MLLib] fix multivarOnlineSummary min/max bug ## What changes were proposed in this pull request? add a member vector `cnnz` to count each dimensions non-zero value

[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-15 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14216 @srowen OK. I'll fix the var names first. nnz => weightSum weightSum => totalWeightSum cnnz => nnz is that right ? --- If your project is set up for it, yo

[GitHub] spark pull request #14220: [SPARK-16568][SQL][Documentation] update sql prog...

2016-07-15 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14220 [SPARK-16568][SQL][Documentation] update sql programming guide refreshTable API in python code ## What changes were proposed in this pull request? update `refreshTable` API in python

[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-15 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14216 @srowen OK var names updated. and the 'fixing' numNonzero which you said means the number of input vectors which weight > 0 ? --- If your project is set up for it, you can

[GitHub] spark pull request #14238: [MINOR][TYPO] fix fininsh typo

2016-07-17 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14238 [MINOR][TYPO] fix fininsh typo ## What changes were proposed in this pull request? fininsh => finish ## How was this patch tested? (Please explain how this patch

[GitHub] spark pull request #14122: [SPARK-16470][ML][Optimizer] Check linear regress...

2016-07-17 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14122#discussion_r71083700 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -327,6 +327,11 @@ class LinearRegression @Since("

[GitHub] spark pull request #14246: [SPARK-16600][MLLib] fix some latex formula synta...

2016-07-18 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14246 [SPARK-16600][MLLib] fix some latex formula syntax error ## What changes were proposed in this pull request? `\partial\x` ==> `\partial x` `har{x_i}` ==> `h

[GitHub] spark issue #14220: [SPARK-16568][SQL][Documentation] update sql programming...

2016-07-18 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14220 cc @liancheng Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request #14265: [PySpark] add picklable SparseMatrix

2016-07-19 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14265 [PySpark] add picklable SparseMatrix ## What changes were proposed in this pull request? add `SparseMatrix` class whick support pickler. ## How was this patch tested

[GitHub] spark issue #14220: [SPARK-16568][SQL][Documentation] update sql programming...

2016-07-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14220 cc @rxin Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #14276: [SPARK-16638][ML][Optimizer] fix L2 reg computati...

2016-07-19 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14276 [SPARK-16638][ML][Optimizer] fix L2 reg computation in linearRegression when standarlization is false ## What changes were proposed in this pull request? when `standardization

[GitHub] spark issue #14276: [SPARK-16638][ML][Optimizer] fix L2 reg computation in l...

2016-07-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14276 cc @srowen Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #14276: [WIP][SPARK-16638][ML][Optimizer] fix L2 reg comp...

2016-07-19 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/14276 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #14276: [WIP][SPARK-16638][ML][Optimizer] fix L2 reg computation...

2016-07-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14276 @srowen I re-think the code again and maybe my previous idea is wrong. The intension of author may be to use w[i] / featuresStd[i] to reduce penalty on large scale dimension (because these

[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-20 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14216 @srowen Now I add testcase, I test 3 cases, they are the same with the example cases I wrote in [SPARK-16561], thanks! --- If your project is set up for it, you can reply to this email and

[GitHub] spark pull request #14286: [SPARK-16653][ML][Optimizer] update ANN convergen...

2016-07-20 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14286 [SPARK-16653][ML][Optimizer] update ANN convergence tolerance param default to 1e-6 ## What changes were proposed in this pull request? replace ANN convergence tolerance param

[GitHub] spark pull request #14293: [GIT] add pydev & Rstudio project file to gitigno...

2016-07-20 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14293 [GIT] add pydev & Rstudio project file to gitignore list ## What changes were proposed in this pull request? Add Pydev & Rstudio project file to gitignore list, I think the

[GitHub] spark pull request #13275: [SPARK-15499][PySpark][Tests] Add python testsuit...

2016-07-20 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/13275 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14265 @srowen I check the ml.python.MLSerde and it support SparseMatrix pickler and at python side the SparseMatrix constructor also match the pickler. So I think the `_picklable_classes

[GitHub] spark issue #14293: [GIT] add pydev & Rstudio project file to gitignore list

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14293 I use PyDev IDE to edit python code and it generate `.pydevproject`, and use Rstudio IDE to edit R code it generate *.Rproj, these are only projects setting files used by the IDEs like `.idea

[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14216 @srowen several minor modifications done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14265 @srowen I guess the `_picklable_classes` list in `ml.linalg.common` is copied from `mllib.linalg.common` so it forgot to add the `SparseMatrix` which is added later. --- If your project is

[GitHub] spark pull request #14301: [SPARK-16662][PySpark][SQL] update HiveContext wa...

2016-07-21 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14301 [SPARK-16662][PySpark][SQL] update HiveContext warning ## What changes were proposed in this pull request? move the `HiveContext` deprecate warning printing statement into

[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14265 cc @rxin Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14216 @srowen yeah I have pushed, "some minor update" https://github.com/apache/spark/pull/14216/commits/362074187d8845eeb40452eceec10f7e8ad805df --- If your project is set up for it, you

[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14265 cc @jkbradley Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14216 @srowen Oh, I miss your comment about loop brace, now it added, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14326 @yanboliang I go through the code and there are several problems need to solve: The robust regression has a parameter `sigma` which must > 0, so that it is a bound optimize prob

[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14265 @srowen We can check python/ml/tests.py, `VectorTests.test_serialize` function, it contains a test for `SparseMatrix` serializing/unserializing, so that we can confirm that this works

[GitHub] spark pull request #14333: [SPARK-16696][ML][MLLib] unused broadcast variabl...

2016-07-24 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14333 [SPARK-16696][ML][MLLib] unused broadcast variables do destroy call to release memory in time ## What changes were proposed in this pull request? update unused broadcast in KMeans

[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14335 [SPARK-16697][ML][MLLib] improve LDA submitMiniBatch method to avoid redundant RDD computation ## What changes were proposed in this pull request? In `LDAOptimizer.submitMiniBatch

[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen The `bcNewCenters` in `KMeans` has some problem. Check the code logic in detail, we can find that in each loop, it should destroy the broadcast var `bcNewCenters` generated in

[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72003428 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72003530 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72003627 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen I check the code about KMean `bcNewCenters` again, if we want to make sure the recovery of RDD will successful in any unexcepted case, we have to keep all the `bcNewCenters

[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72013619 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72014278 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen The KMeans.initKMeansParallel already implements the pattern "persist current step RDD, and unpersist previous one", but I think an RDD persisted can also break down becau

[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 yeah, but the `bcSyn0Global` in Word2Vec is a difference case, it looks safe there to destroy, because in each loop iteration, the RDD transform which use `bcSyn0Global` ends with a

[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen The sparkContext, by default, will running a cleaner to release not referenced RDD/broadcasts on background. But, I think, we'd better to release them by ourselves becaus

[GitHub] spark issue #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBatch met...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14335 @srowen `stats.unpersist(false)` ==> `stats.unpersist()` updated. is there anything else need to update ? --- If your project is set up for it, you can reply to this email and h

[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen I check `RDD.persist` referenced place: AFTSuvivalRegression, LinearRegression, LogisticRegression, will persist input training RDD and unpersist them when `train` return

[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen yeah, the code logic here seems confusing, but I think it is right. Now I can explain it in a clear way: in essence, the logic can be expressed as following: A0->I1->

[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-30 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen I check code again, the problem I mentioned above `But now I found another problem in BisectKMeans: in line 191 there is a iteration it also need this pattern “persist

[GitHub] spark pull request #14440: [SPARK-16835][ML] add training data unpersist han...

2016-08-01 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14440 [SPARK-16835][ML] add training data unpersist handling when throw exception [SPARK-16835][ML] add training data `unpersist` handling when throw exception ## What changes were

[GitHub] spark pull request #14440: [SPARK-16835][ML] add training data unpersist han...

2016-08-01 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/14440 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #14440: [SPARK-16835][ML] add training data unpersist handling w...

2016-08-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14440 sounds reasonable... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request #14483: [SPARK-16880][ML][MLLib] make ann training data p...

2016-08-03 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14483 [SPARK-16880][ML][MLLib] make ann training data persisted if needed ## What changes were proposed in this pull request? To Make sure ANN layer input training data to be persisted

[GitHub] spark issue #14483: [SPARK-16880][ML][MLLib] make ann training data persiste...

2016-08-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14483 @srowen yeah, others algorithm using LBFGS all have this pattern, only ANN forgot it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...

2016-08-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14156 cc @srowen thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...

2016-08-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14156 @srowen The := operator in BDM is simply copy one BDM to another, and it is widely used in breeze source, e.g, we can check DenseMatrix.copy function in Breeze: it first use

[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...

2016-08-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14156 yeah, currently it seems to make a little overhead (do a copy), but I think it will take advantage of breeze optimization, in the future, e.g, SIMD instructions or something ? --- If your

<    1   2   3   4   5   6   7   8   9   10   >