[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...
Github user sandecho closed the pull request at: https://github.com/apache/spark/pull/20707 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...
GitHub user sandecho reopened a pull request: https://github.com/apache/spark/pull/20707 [SPARK-21209][MLLLIB] Implement Incremental PCA algorithm ## What changes were proposed in this pull request? A new feature called Incremental Principal Component Analysis Algorithm(IPCA) has been proposed. It divides the incoming data in batch size and compute the PCA of the individual batch to generate Principal Component of entire data. ## How was this patch tested? Unit Testing You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20707.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20707 commit 6bb22961c0c9df1a1f22e9491894895b297f5288 Author: Sameer AgarwalDate: 2018-01-11T23:23:17Z Preparing development version 2.3.1-SNAPSHOT commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d Author: WeichenXu Date: 2018-01-12T00:20:30Z [SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu Closes #20209 from WeichenXu123/ohe_py. (cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37) Signed-off-by: Joseph K. Bradley commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2 Author: ho3rexqj Date: 2018-01-12T07:27:00Z [SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances of broadcast variable values When resources happen to be constrained on an executor the first time a broadcast variable is instantiated it is persisted to disk by the BlockManager. Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock from other instances of that broadcast variable spawns another instance of the underlying value. That is, broadcast variables are spawned once per executor **unless** memory is constrained, in which case every instance of a broadcast variable is provided with a unique copy of the underlying value. This patch fixes the above by explicitly caching the underlying values using weak references in a ReferenceMap. Author: ho3rexqj Closes #20183 from ho3rexqj/fix/cache-broadcast-values. (cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea) Signed-off-by: Wenchen Fan commit d512d873b3f445845bd113272d7158388427f8a6 Author: WeichenXu Date: 2018-01-12T09:27:02Z [SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated ## What changes were proposed in this pull request? mark OneHotEncoder python API deprecated ## How was this patch tested? N/A Author: WeichenXu Closes #20241 from WeichenXu123/mark_ohe_deprecated. (cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6) Signed-off-by: Nick Pentreath commit 6152da3893a05b3f8dc0f13895af9be9548e5895 Author: Marco Gaido Date: 2018-01-12T10:04:44Z [SPARK-23025][SQL] Support Null type in scala reflection ## What changes were proposed in this pull request? Add support for `Null` type in the `schemaFor` method for Scala reflection. ## How was this patch tested? Added UT Author: Marco Gaido Closes #20219 from mgaido91/SPARK-23025. (cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c) Signed-off-by: gatorsmile commit db27a93652780f234f3c5fe750ef07bc5525d177 Author: Dongjoon Hyun Date: 2018-01-12T18:18:42Z [MINOR][BUILD] Fix Java linter errors ## What changes were proposed in this pull request? This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully, this will be the final one. ``` $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8] (imports) UnusedImports: Unused import - java.io.IOException. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9] (modifier) ModifierOrder: 'private' modifier out of order with the JLS suggestions. [ERROR]
[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...
Github user sandecho closed the pull request at: https://github.com/apache/spark/pull/20707 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...
GitHub user sandecho opened a pull request: https://github.com/apache/spark/pull/20707 [SPARK-21209][MLLLIB] Implement Incremental PCA algorithm ## What changes were proposed in this pull request? A new feature called Incremental Principal Component Analysis Algorithm(IPCA) has been proposed. It divides the incoming data in batch size and compute the PCA of the individual batch to generate Principal Component of entire data. ## How was this patch tested? Unit Testing Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20707.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20707 commit 6bb22961c0c9df1a1f22e9491894895b297f5288 Author: Sameer AgarwalDate: 2018-01-11T23:23:17Z Preparing development version 2.3.1-SNAPSHOT commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d Author: WeichenXu Date: 2018-01-12T00:20:30Z [SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu Closes #20209 from WeichenXu123/ohe_py. (cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37) Signed-off-by: Joseph K. Bradley commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2 Author: ho3rexqj Date: 2018-01-12T07:27:00Z [SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances of broadcast variable values When resources happen to be constrained on an executor the first time a broadcast variable is instantiated it is persisted to disk by the BlockManager. Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock from other instances of that broadcast variable spawns another instance of the underlying value. That is, broadcast variables are spawned once per executor **unless** memory is constrained, in which case every instance of a broadcast variable is provided with a unique copy of the underlying value. This patch fixes the above by explicitly caching the underlying values using weak references in a ReferenceMap. Author: ho3rexqj Closes #20183 from ho3rexqj/fix/cache-broadcast-values. (cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea) Signed-off-by: Wenchen Fan commit d512d873b3f445845bd113272d7158388427f8a6 Author: WeichenXu Date: 2018-01-12T09:27:02Z [SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated ## What changes were proposed in this pull request? mark OneHotEncoder python API deprecated ## How was this patch tested? N/A Author: WeichenXu Closes #20241 from WeichenXu123/mark_ohe_deprecated. (cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6) Signed-off-by: Nick Pentreath commit 6152da3893a05b3f8dc0f13895af9be9548e5895 Author: Marco Gaido Date: 2018-01-12T10:04:44Z [SPARK-23025][SQL] Support Null type in scala reflection ## What changes were proposed in this pull request? Add support for `Null` type in the `schemaFor` method for Scala reflection. ## How was this patch tested? Added UT Author: Marco Gaido Closes #20219 from mgaido91/SPARK-23025. (cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c) Signed-off-by: gatorsmile commit db27a93652780f234f3c5fe750ef07bc5525d177 Author: Dongjoon Hyun Date: 2018-01-12T18:18:42Z [MINOR][BUILD] Fix Java linter errors ## What changes were proposed in this pull request? This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully, this will be the final one. ``` $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8] (imports) UnusedImports: Unused import - java.io.IOException. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9] (modifier) ModifierOrder: