[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...

2018-03-01 Thread sandecho
Github user sandecho closed the pull request at:

https://github.com/apache/spark/pull/20707


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...

2018-03-01 Thread sandecho
GitHub user sandecho reopened a pull request:

https://github.com/apache/spark/pull/20707

[SPARK-21209][MLLLIB] Implement Incremental PCA algorithm

## What changes were proposed in this pull request?

A new feature called Incremental Principal Component Analysis 
Algorithm(IPCA) has been proposed. It divides the incoming data in batch size 
and compute the PCA of the individual batch to generate Principal Component of 
entire data.

## How was this patch tested?
Unit Testing



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20707


commit 6bb22961c0c9df1a1f22e9491894895b297f5288
Author: Sameer Agarwal 
Date:   2018-01-11T23:23:17Z

Preparing development version 2.3.1-SNAPSHOT

commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d
Author: WeichenXu 
Date:   2018-01-12T00:20:30Z

[SPARK-23008][ML] OnehotEncoderEstimator python API

## What changes were proposed in this pull request?

OnehotEncoderEstimator python API.

## How was this patch tested?

doctest

Author: WeichenXu 

Closes #20209 from WeichenXu123/ohe_py.

(cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37)
Signed-off-by: Joseph K. Bradley 

commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2
Author: ho3rexqj 
Date:   2018-01-12T07:27:00Z

[SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances 
of broadcast variable values

When resources happen to be constrained on an executor the first time a 
broadcast variable is instantiated it is persisted to disk by the BlockManager. 
Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock 
from other instances of that broadcast variable spawns another instance of the 
underlying value. That is, broadcast variables are spawned once per executor 
**unless** memory is constrained, in which case every instance of a broadcast 
variable is provided with a unique copy of the underlying value.

This patch fixes the above by explicitly caching the underlying values 
using weak references in a ReferenceMap.

Author: ho3rexqj 

Closes #20183 from ho3rexqj/fix/cache-broadcast-values.

(cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea)
Signed-off-by: Wenchen Fan 

commit d512d873b3f445845bd113272d7158388427f8a6
Author: WeichenXu 
Date:   2018-01-12T09:27:02Z

[SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated

## What changes were proposed in this pull request?

mark OneHotEncoder python API deprecated

## How was this patch tested?

N/A

Author: WeichenXu 

Closes #20241 from WeichenXu123/mark_ohe_deprecated.

(cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6)
Signed-off-by: Nick Pentreath 

commit 6152da3893a05b3f8dc0f13895af9be9548e5895
Author: Marco Gaido 
Date:   2018-01-12T10:04:44Z

[SPARK-23025][SQL] Support Null type in scala reflection

## What changes were proposed in this pull request?

Add support for `Null` type in the `schemaFor` method for Scala reflection.

## How was this patch tested?

Added UT

Author: Marco Gaido 

Closes #20219 from mgaido91/SPARK-23025.

(cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c)
Signed-off-by: gatorsmile 

commit db27a93652780f234f3c5fe750ef07bc5525d177
Author: Dongjoon Hyun 
Date:   2018-01-12T18:18:42Z

[MINOR][BUILD] Fix Java linter errors

## What changes were proposed in this pull request?

This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully, 
this will be the final one.

```
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] 
src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85] 
(sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] 
src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8] 
(imports) UnusedImports: Unused import - java.io.IOException.
[ERROR] 
src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9]
 (modifier) ModifierOrder: 'private' modifier out of order with the JLS 
suggestions.
[ERROR] 

[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...

2018-03-01 Thread sandecho
Github user sandecho closed the pull request at:

https://github.com/apache/spark/pull/20707


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20707: [SPARK-21209][MLLLIB] Implement Incremental PCA a...

2018-03-01 Thread sandecho
GitHub user sandecho opened a pull request:

https://github.com/apache/spark/pull/20707

[SPARK-21209][MLLLIB] Implement Incremental PCA algorithm

## What changes were proposed in this pull request?

A new feature called Incremental Principal Component Analysis 
Algorithm(IPCA) has been proposed. It divides the incoming data in batch size 
and compute the PCA of the individual batch to generate Principal Component of 
entire data.

## How was this patch tested?
Unit Testing

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20707


commit 6bb22961c0c9df1a1f22e9491894895b297f5288
Author: Sameer Agarwal 
Date:   2018-01-11T23:23:17Z

Preparing development version 2.3.1-SNAPSHOT

commit 55695c7127cb2f357dfdf677cab4d21fc840aa3d
Author: WeichenXu 
Date:   2018-01-12T00:20:30Z

[SPARK-23008][ML] OnehotEncoderEstimator python API

## What changes were proposed in this pull request?

OnehotEncoderEstimator python API.

## How was this patch tested?

doctest

Author: WeichenXu 

Closes #20209 from WeichenXu123/ohe_py.

(cherry picked from commit b5042d75c2faa5f15bc1e160d75f06dfdd6eea37)
Signed-off-by: Joseph K. Bradley 

commit 3ae3e1bb71aa88be1c963b4416986ef679d7c8a2
Author: ho3rexqj 
Date:   2018-01-12T07:27:00Z

[SPARK-22986][CORE] Use a cache to avoid instantiating multiple instances 
of broadcast variable values

When resources happen to be constrained on an executor the first time a 
broadcast variable is instantiated it is persisted to disk by the BlockManager. 
Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock 
from other instances of that broadcast variable spawns another instance of the 
underlying value. That is, broadcast variables are spawned once per executor 
**unless** memory is constrained, in which case every instance of a broadcast 
variable is provided with a unique copy of the underlying value.

This patch fixes the above by explicitly caching the underlying values 
using weak references in a ReferenceMap.

Author: ho3rexqj 

Closes #20183 from ho3rexqj/fix/cache-broadcast-values.

(cherry picked from commit cbe7c6fbf9dc2fc422b93b3644c40d449a869eea)
Signed-off-by: Wenchen Fan 

commit d512d873b3f445845bd113272d7158388427f8a6
Author: WeichenXu 
Date:   2018-01-12T09:27:02Z

[SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated

## What changes were proposed in this pull request?

mark OneHotEncoder python API deprecated

## How was this patch tested?

N/A

Author: WeichenXu 

Closes #20241 from WeichenXu123/mark_ohe_deprecated.

(cherry picked from commit a7d98d53ceaf69cabaecc6c9113f17438c4e61f6)
Signed-off-by: Nick Pentreath 

commit 6152da3893a05b3f8dc0f13895af9be9548e5895
Author: Marco Gaido 
Date:   2018-01-12T10:04:44Z

[SPARK-23025][SQL] Support Null type in scala reflection

## What changes were proposed in this pull request?

Add support for `Null` type in the `schemaFor` method for Scala reflection.

## How was this patch tested?

Added UT

Author: Marco Gaido 

Closes #20219 from mgaido91/SPARK-23025.

(cherry picked from commit 505086806997b4331d4a8c2fc5e08345d869a23c)
Signed-off-by: gatorsmile 

commit db27a93652780f234f3c5fe750ef07bc5525d177
Author: Dongjoon Hyun 
Date:   2018-01-12T18:18:42Z

[MINOR][BUILD] Fix Java linter errors

## What changes were proposed in this pull request?

This PR cleans up the java-lint errors (for v2.3.0-rc1 tag). Hopefully, 
this will be the final one.

```
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] 
src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[85] 
(sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] 
src/main/java/org/apache/spark/launcher/InProcessAppHandle.java:[20,8] 
(imports) UnusedImports: Unused import - java.io.IOException.
[ERROR] 
src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java:[41,9]
 (modifier) ModifierOrder: