[GitHub] spark pull request #23179: Fix the rat excludes on .policy.yml

2018-11-29 Thread justinuang
Github user justinuang closed the pull request at:

https://github.com/apache/spark/pull/23179


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23179: Fix the rat excludes on .policy.yml

2018-11-29 Thread justinuang
GitHub user justinuang opened a pull request:

https://github.com/apache/spark/pull/23179

Fix the rat excludes on .policy.yml

## What changes were proposed in this pull request?

Fix the rat excludes on .policy.yml

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/palantir/spark juang/fix-rat-policy-yml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23179.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23179


commit 78b34b40a7e034dd641418b804e6e2606b216ba4
Author: Robert Kruszewski 
Date:   2018-04-01T12:41:47Z

Fix publish after k8s rebase (#347)

commit 2788441fb6f945d1d945caa4675c97b8b2f5a472
Author: Patrick Woody 
Date:   2018-04-02T17:54:15Z

Revert "transformexpression with origin" (#350)

commit 4cc4dee11883bf1954181ec808f0f57a9ee55c55
Author: Patrick Woody 
Date:   2018-04-02T17:54:25Z

Add reminder for upstream ticket/PR to github template (#351)

commit 078066bdc9a77dd0c241fae544806d043cb0b167
Author: Robert Kruszewski 
Date:   2018-03-31T14:25:56Z

resolve conflicts

commit fe35b58a9e8b1bdde111b542371123907686ba97
Author: mcheah 
Date:   2018-04-02T21:37:23Z

Empty commit to clear Circle cache.

commit 1264fb5908d3eab2cccfaf9b22b6975c7afd20d4
Author: mcheah 
Date:   2018-04-03T00:29:36Z

Empty commit to tag 2.4.0-palantir.12 and trigger publish.

commit b7410ba819d4e3e37f59e8f5df0d47e78c92a362
Author: Robert Kruszewski 
Date:   2018-04-03T14:02:21Z

Fix circle checkout for tags (#352)

commit 6da0b8266906f3e1c804627c9a009a18ed102874
Author: Robert Kruszewski 
Date:   2018-04-03T17:32:23Z

Merge pull request #346 from palantir/rk/upstream

Update to upstream

commit 7b12f6367dbf5d5b1da06aa0cf204658de2ebbe7
Author: Bryan Cutler 
Date:   2018-04-02T16:53:37Z

[SPARK-15009][PYTHON][FOLLOWUP] Add default param checks for 
CountVectorizerModel

## What changes were proposed in this pull request?

Adding test for default params for `CountVectorizerModel` constructed from 
vocabulary.  This required that the param `maxDF` be added, which was done in 
SPARK-23615.

## How was this patch tested?

Added an explicit test for CountVectorizerModel in DefaultValuesTests.

Author: Bryan Cutler 

Closes #20942 from 
BryanCutler/pyspark-CountVectorizerModel-default-param-test-SPARK-15009.

commit 60e1bd62d72cc5fadbfc96ad6b1f3b84bd36335e
Author: David Vogelbacher 
Date:   2018-04-02T19:00:37Z

[SPARK-23825][K8S] Requesting memory + memory overhead for pod memory

## What changes were proposed in this pull request?

Kubernetes driver and executor pods should request `memory + 
memoryOverhead` as their resources instead of just `memory`, see 
https://issues.apache.org/jira/browse/SPARK-23825

## How was this patch tested?
Existing unit tests were adapted.

Author: David Vogelbacher 

Closes #20943 from dvogelbacher/spark-23825.

commit 08f64b4048072a97a92dca94ded78f2de46525f2
Author: Yinan Li 
Date:   2018-04-02T19:20:55Z

[SPARK-23285][K8S] Add a config property for specifying physical executor 
cores

## What changes were proposed in this pull request?

As mentioned in SPARK-23285, this PR introduces a new configuration 
property `spark.kubernetes.executor.cores` for specifying the physical CPU 
cores requested for each executor pod. This is to avoid changing the semantics 
of `spark.executor.cores` and `spark.task.cpus` and their role in task 
scheduling, task parallelism, dynamic resource allocation, etc. The new 
configuration property only determines the physical CPU cores available to an 
executor. An executor can still run multiple tasks simultaneously by using 
appropriate values for `spark.executor.cores` and `spark.task.cpus`.

## How was this patch tested?

Unit tests.

felixcheung srowen jiangxb1987 jerryshao mccheah foxish

Author: Yinan Li 
Author: Yinan Li 

Closes #20553 from liyinan926/master.

commit 8a307d1b4db5ed9e6634142002139945ff3a79bd
Author: Kazuaki Ishizaki 
Date:   2018-04-02T19:48:44Z

[SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes

## What changes were proposed in this pull request?

This PR implemented the following cleanups related to  `UnsafeWriter` class:
- Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter`
- Make `BufferHolder` class internal by delegating its accessor methods to 
`UnsafeWriter`
- Replace `UnsafeRow.setTotalSize(...)` with 
`UnsafeRowWriter.setTotalSize()`

## How was this patch tested?

Tested by existing UTs

Author: Kazuaki Ishizaki 

Closes #20850 from kiszk/SPARK-2371

[GitHub] spark issue #20877: [SPARK-23765][SQL] Supports custom line separator for js...

2018-11-27 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/20877
  
Sorry, I won't be able to take it over!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23051: [AE2.3-02][SPARK-23128] Add QueryStage and the fr...

2018-11-15 Thread justinuang
GitHub user justinuang opened a pull request:

https://github.com/apache/spark/pull/23051

[AE2.3-02][SPARK-23128] Add QueryStage and the framework for adaptive 
execution (auto setting the number of reducer)

## What changes were proposed in this pull request?

Add QueryStage and the framework for adaptive execution.

The main benefit from this PR is that the reducer count is set 
automatically based on a target file size.

We got this PR (branch ae-02) from 
https://github.com/Intel-bigdata/spark-adaptive/pull/43, which is based on 
branch ae-01, but I decided to not merge those in because they require invasive 
changes to spark-core, and a protocol change to the external shuffle service.

This PR should be relatively safe to merge because most of the code changes 
are to adaptive query execution, which isn't turned on by default.

## How was this patch tested?

Unit tests

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/palantir/spark juang/cherry-pick-ae-02

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23051.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23051


commit 7827060679786afc02e7e6e3d778c2fcb2c13db9
Author: Dan Sanduleac 
Date:   2018-03-24T00:40:20Z

v1-maven-build-with-version should cache by revision not buildNum since it 
needs to be common between different jobs

commit 06290c19132f0be5a3e5f6a32b3c4458beadc394
Author: Dan Sănduleac 
Date:   2018-03-24T20:19:19Z

Ignore flaky scala tests as well as hive tests (#335)

commit 2bc8fafe45711a64a56f4d031e75dc609c5314e6
Author: Dan Sănduleac 
Date:   2018-03-25T18:16:16Z

Treat classnames with only skipped tests as having taken 0 time (#336)

commit 2c8c96be2b9719cc998d113d5c7cabf6c51a2403
Author: Robert Kruszewski 
Date:   2018-03-26T09:38:09Z

Force okhttp logging interceptor(#337)

commit 4c99e6354198ec11b46ffc38014fdab6b55dcffd
Author: Dan Sănduleac 
Date:   2018-03-26T11:13:07Z

Handle nulls in k8s responses correctly (#334)

commit cf31e8342e5c0b771c2b5dcb3b9a86540adf1f92
Author: Dan Sănduleac 
Date:   2018-03-26T12:18:15Z

Store/restore ~/.m2 after versioned build (since pom.xml changes) (#339)

commit de656d21c658bad0b7f873e9da541b7cb303c5fa
Author: Dan Sănduleac 
Date:   2018-03-27T00:12:43Z

build-sbt directly, and don't restore build-maven where not necessary (#340)

commit d531f534734226dc65be04ec9e9714792afa983c
Author: Dan Sanduleac 
Date:   2018-03-27T11:59:34Z

empty commit

commit 1aeaf27ae65ff3f625235b48a3a0e75d0a3fbb11
Author: Dan Sănduleac 
Date:   2018-03-28T12:46:12Z

Faster deploy by parallelizing maven and skipping unnecessary second 'mvn 
package' (#342)

commit 44a14cdafe247f7094d7571e00cfd8e85bf0e397
Author: Jeremy Liu 
Date:   2018-03-28T19:54:33Z

Move RBackend to member variable

commit 5d88c9527b602728ccaf0a48d0106b2729d46a2a
Author: Dan Sănduleac 
Date:   2018-03-29T14:02:06Z

[SPARK-23795][LAUNCHER] Make AbstractLauncher#self() protected (#341)

commit 41415d4865b625da8516739e0e63acdb1137a3b0
Author: mcheah 
Date:   2018-03-07T01:59:03Z

Rebase to upstream's version of Kubernetes support.

commit 4ac24329b53e51cdc3990f634ed7a2249c8423e3
Author: mcheah 
Date:   2018-03-12T20:46:21Z

Replace manifest

commit 6d23bae6fcccb483128c9d70438653b0c239c8c6
Author: Ilan Filonenko 
Date:   2018-03-19T18:29:56Z

[SPARK-22839][K8S] Remove the use of init-container for downloading remote 
dependencies

Removal of the init-container for downloading remote dependencies. Built 
off of the work done by vanzin in an attempt to refactor driver/executor 
configuration elaborated in 
[this](https://issues.apache.org/jira/browse/SPARK-22839) ticket.

This patch was tested with unit and integration tests.

Author: Ilan Filonenko 

Closes #20669 from ifilonenko/remove-init-container.

commit 1d60e389e6b84b158a91e1a9cdeeb124949c4d07
Author: mcheah 
Date:   2018-03-29T18:51:02Z

Match entrypoint as well

commit 5774deb0022235455e84387a304fa6823f939f74
Author: amenck 
Date:   2018-03-29T19:56:56Z

Merge pull request #343 from jeremyjliu/jl/expose-r-backend

Move RBackend to member variable

commit 4e7f4f09512a5a30f72ed679fad594f87b12db75
Author: Dan Sănduleac 
Date:   2018-03-29T20:29:52Z

Properly remove hive from modules (#338)

commit 95cf5f7523f60cfdd7fdc9d76dfd2668c287785c
Author: mccheah 
Date:   2018-03-29T22:57:40Z

Merge pull request #324 from palantir/use-upstream-kubernetes

Rebase to upstream's version of Kubernetes support.

commit a7383de811ea01f60aeb642b6c192aecef14ff6a
Author: Robert Kruszewski 
Date:   2018-03-30T02

[GitHub] spark pull request #23051: [AE2.3-02][SPARK-23128] Add QueryStage and the fr...

2018-11-15 Thread justinuang
Github user justinuang closed the pull request at:

https://github.com/apache/spark/pull/23051


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22968: Merge upstream

2018-11-07 Thread justinuang
Github user justinuang closed the pull request at:

https://github.com/apache/spark/pull/22968


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22968: Merge upstream

2018-11-07 Thread justinuang
GitHub user justinuang opened a pull request:

https://github.com/apache/spark/pull/22968

Merge upstream

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/palantir/spark juang/merge-easy-upstream

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22968.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22968


commit 349a63c12acc919dbd25cc54fe75230b6f16224a
Author: Dan Sănduleac 
Date:   2018-03-23T19:41:50Z

Improve bin packing and reduce scala parallelism (#333)

commit 7827060679786afc02e7e6e3d778c2fcb2c13db9
Author: Dan Sanduleac 
Date:   2018-03-24T00:40:20Z

v1-maven-build-with-version should cache by revision not buildNum since it 
needs to be common between different jobs

commit 06290c19132f0be5a3e5f6a32b3c4458beadc394
Author: Dan Sănduleac 
Date:   2018-03-24T20:19:19Z

Ignore flaky scala tests as well as hive tests (#335)

commit 2bc8fafe45711a64a56f4d031e75dc609c5314e6
Author: Dan Sănduleac 
Date:   2018-03-25T18:16:16Z

Treat classnames with only skipped tests as having taken 0 time (#336)

commit 2c8c96be2b9719cc998d113d5c7cabf6c51a2403
Author: Robert Kruszewski 
Date:   2018-03-26T09:38:09Z

Force okhttp logging interceptor(#337)

commit 4c99e6354198ec11b46ffc38014fdab6b55dcffd
Author: Dan Sănduleac 
Date:   2018-03-26T11:13:07Z

Handle nulls in k8s responses correctly (#334)

commit cf31e8342e5c0b771c2b5dcb3b9a86540adf1f92
Author: Dan Sănduleac 
Date:   2018-03-26T12:18:15Z

Store/restore ~/.m2 after versioned build (since pom.xml changes) (#339)

commit de656d21c658bad0b7f873e9da541b7cb303c5fa
Author: Dan Sănduleac 
Date:   2018-03-27T00:12:43Z

build-sbt directly, and don't restore build-maven where not necessary (#340)

commit d531f534734226dc65be04ec9e9714792afa983c
Author: Dan Sanduleac 
Date:   2018-03-27T11:59:34Z

empty commit

commit 1aeaf27ae65ff3f625235b48a3a0e75d0a3fbb11
Author: Dan Sănduleac 
Date:   2018-03-28T12:46:12Z

Faster deploy by parallelizing maven and skipping unnecessary second 'mvn 
package' (#342)

commit 44a14cdafe247f7094d7571e00cfd8e85bf0e397
Author: Jeremy Liu 
Date:   2018-03-28T19:54:33Z

Move RBackend to member variable

commit 5d88c9527b602728ccaf0a48d0106b2729d46a2a
Author: Dan Sănduleac 
Date:   2018-03-29T14:02:06Z

[SPARK-23795][LAUNCHER] Make AbstractLauncher#self() protected (#341)

commit 41415d4865b625da8516739e0e63acdb1137a3b0
Author: mcheah 
Date:   2018-03-07T01:59:03Z

Rebase to upstream's version of Kubernetes support.

commit 4ac24329b53e51cdc3990f634ed7a2249c8423e3
Author: mcheah 
Date:   2018-03-12T20:46:21Z

Replace manifest

commit 6d23bae6fcccb483128c9d70438653b0c239c8c6
Author: Ilan Filonenko 
Date:   2018-03-19T18:29:56Z

[SPARK-22839][K8S] Remove the use of init-container for downloading remote 
dependencies

Removal of the init-container for downloading remote dependencies. Built 
off of the work done by vanzin in an attempt to refactor driver/executor 
configuration elaborated in 
[this](https://issues.apache.org/jira/browse/SPARK-22839) ticket.

This patch was tested with unit and integration tests.

Author: Ilan Filonenko 

Closes #20669 from ifilonenko/remove-init-container.

commit 1d60e389e6b84b158a91e1a9cdeeb124949c4d07
Author: mcheah 
Date:   2018-03-29T18:51:02Z

Match entrypoint as well

commit 5774deb0022235455e84387a304fa6823f939f74
Author: amenck 
Date:   2018-03-29T19:56:56Z

Merge pull request #343 from jeremyjliu/jl/expose-r-backend

Move RBackend to member variable

commit 4e7f4f09512a5a30f72ed679fad594f87b12db75
Author: Dan Sănduleac 
Date:   2018-03-29T20:29:52Z

Properly remove hive from modules (#338)

commit 95cf5f7523f60cfdd7fdc9d76dfd2668c287785c
Author: mccheah 
Date:   2018-03-29T22:57:40Z

Merge pull request #324 from palantir/use-upstream-kubernetes

Rebase to upstream's version of Kubernetes support.

commit a7383de811ea01f60aeb642b6c192aecef14ff6a
Author: Robert Kruszewski 
Date:   2018-03-30T02:30:51Z

mapexpressions preserving origin

commit 479cf4633bc415b33fd80fc969be885b0decc5cb
Author: Robert Kruszewski 
Date:   2018-03-30T02:34:47Z

correct place

commit bb10a57784784fa0f661540aa5cf3acb4dad7651
Author: mccheah 
Date:   2018-03-30T19:13:37Z

Merge pull request #344 from palantir/rk

[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-18 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22503#discussion_r226386187
  
--- Diff: sql/core/src/test/resources/test-data/cars-crlf.csv ---
@@ -0,0 +1,7 @@
+
+year,make,model,comment,blank
+"2012","Tesla","S","No comment",
+
+1997,Ford,E350,"Go get one now they are going fast",
+2015,Chevy,Volt
+
--- End diff --

Yea, if you open it with `vi -b 
sql/core/src/test/resources/test-data/cars-crlf.csv`, you can see the `^M` 
characters which according to 
[this](https://stackoverflow.com/questions/3860519/see-line-breaks-and-carriage-returns-in-editor)
 represents a CLRF.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-10-17 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/22503
  
done!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-10-16 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/22503
  
So Hadoop's LineReader looks like it handles CR, LF, CRLF:


https://github.com/apache/hadoop/blob/f90c64e6242facf38c2baedeeda42e4a8293e642/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L36

Univocity handles CR, LF, CRLF (the logic is a bit convoluted but it looks 
like they have the same behavior in that if they see a CR, they will look for a 
LF next):


https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/input/LineSeparatorDetector.java

I do agree we should expose the option of `setLineSeparator`, but 
regardless of that, the default behavior of handling CR, LF, CRLF should be the 
same between single line and multiline mode.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22680: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-09 Thread justinuang
Github user justinuang closed the pull request at:

https://github.com/apache/spark/pull/22680


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22680: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-09 Thread justinuang
GitHub user justinuang opened a pull request:

https://github.com/apache/spark/pull/22680

[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline 
mode

## Upstream SPARK-X ticket and PR link (if not applicable, explain)

Went through review, but upstream is not merging. Discussed offline with 
@vinooganesh that we will merge here first.

https://github.com/apache/spark/pull/22503/files

## What changes were proposed in this pull request?

CSVs with windows style crlf ('\r\n') don't work in multiline mode. They 
work fine in single line mode because the line separation is done by Hadoop, 
which can handle all the different types of line separators. This PR fixes it 
by enabling Univocity's line separator detection in multiline mode, which will 
detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single 
line mode.

## How was this patch tested?

Unit test with a file with crlf line endings.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/palantir/spark 
palantirspark/fix-clrf-multiline

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22680


commit 8bc932a49a76d482510242a7d040fdf7e888c895
Author: Dan Sanduleac 
Date:   2018-03-18T15:15:59Z

Delete commented out code that's no longer applicable

commit cd4afe2e13a7140023cb50a8b9be798bb7b86e61
Author: Dan Sanduleac 
Date:   2018-03-18T15:16:22Z

Bump build-sbt cache to v1-build-sbt.. think old cache causes the OOM 
somehow

commit 047e65a0b916a756d2dc7b8106acfd94f530f07a
Author: Dan Sanduleac 
Date:   2018-03-18T23:31:39Z

Move all-project exclusion and global setting to DefaultSparkPlugin, nuke 
excludeDependencies

commit 53d6f5aa9f8627388a732ae6dd3ced0b6192fcf3
Author: Dan Sanduleac 
Date:   2018-03-18T23:32:26Z

Make enable() accept any DslEntry allowing enablePlugins etc not just 
Seq[Setting[_]]

commit e154185f7e4c94da78375de4f590f2fd20610e6b
Author: Dan Sanduleac 
Date:   2018-03-19T00:56:57Z

Exclude com.sun.jersey crap but only from copyJarsProjects (assembly, 
examples)

commit ecd06e96825220bdf8dbec3c5fa8725aae89eb7c
Author: Dan Sanduleac 
Date:   2018-03-19T00:59:03Z

revert unnecessary exclusions in hadoop-palantir/pom.xml

commit bde3a2af44e9def680615aa47d82bc4decf43a18
Author: Dan Sanduleac 
Date:   2018-03-19T01:04:09Z

ensure we update sbt before getting externalDependencyClasspath, prevent 
badly cached resolution results!!

commit cae7f8cc4381d01074815d9c9c29bf115b855cac
Author: Dan Sanduleac 
Date:   2018-03-19T12:25:13Z

delete unnecesary sbt cache restore in deploy

commit f4af82f99e0f14843c762a7f98d0f952cfb57d52
Author: Dan Sanduleac 
Date:   2018-03-19T12:29:44Z

make home-sbt cache depend on project update inputs

commit 35bebba53c5c9c7f9427334bab8a47bb9129be1c
Author: Dan Sanduleac 
Date:   2018-03-19T12:48:59Z

python / R tests also don't use SBT or maven

commit bec5c1eb9cd4489662b49770c0f77d0202987479
Author: Dan Sanduleac 
Date:   2018-03-19T12:51:09Z

fix v2-home-sbt cache, I guess it doesn't need escaping $

commit 82197f8198b44a3beea0430274e43e2d2b7509a5
Author: Dan Sanduleac 
Date:   2018-03-19T15:27:05Z

Log which tests (per project) didn't have timings

commit 00f28de1684632afcc43dd3e21781c584915509d
Author: Dan Sanduleac 
Date:   2018-03-19T15:36:46Z

Log which tests (per project) didn't have timings

commit 8b7fd7ff2be699e8aebcb014e1033e12e72b585e
Author: Dan Sanduleac 
Date:   2018-03-19T16:43:47Z

parallelize python tests, and feed the right versions into packaging tests 
(run-pip-tests)

commit 886f496483eaf673fda58a7fde932dfedd514bcd
Author: Dan Sanduleac 
Date:   2018-03-19T17:10:47Z

I guess we need to set up logging too

commit 0047e3165c973b11ea6f2e63c90fcb1888e93c3a
Author: Dan Sanduleac 
Date:   2018-03-19T18:26:10Z

disable cached resolution

commit 0e749a60656c7574529f5aee4d8785697f5af668
Author: Dan Sanduleac 
Date:   2018-03-19T23:48:43Z

run all python tests before giving up, don't stop early

commit 302b8351c44882c84ce1eb5fd6fb0454b4e1c276
Author: Dan Sanduleac 
Date:   2018-03-20T00:20:29Z

Explicitly calling `update` seems to be very slow without cached 
resolution. If we just call externalDependencyClasspath though that might be 
enough

commit 636bef690d799d3e60dcabfaef783d2636f84878
Author: Dan Sanduleac 
Date:   2018-03-22T01:15:28Z

downgrade yer numpys. newer numpy breaks a bunch of tests because of 
different array formatting & more

commit 7aaf1c6d3abdc3093f73e5f25b8f9bbc5d60b756
Author: Dan Sanduleac 
Date:   2018-03-22T01:25:50Z

Remove commented out miniconda installation since it's now in base, add 
comment about numpy

co

[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-02 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22503#discussion_r222053706
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -212,6 +212,8 @@ class CSVOptions(
 settings.setEmptyValue(emptyValueInRead)
 settings.setMaxCharsPerColumn(maxCharsPerColumn)
 
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+settings.setLineSeparatorDetectionEnabled(multiLine)
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-09-28 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/22503
  
What does it take to get this to be merged in?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-09-26 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/22503
  
Sounds good, thanks guys =)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-09-25 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/22503
  
It looks like a flake? Can someone retrigger it?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96511/console


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf

2018-09-20 Thread justinuang
GitHub user justinuang opened a pull request:

https://github.com/apache/spark/pull/22503

[SPARK-25493] [SQL] Fix multiline crlf

## What changes were proposed in this pull request?

CSVs with windows style crlf (carriage return line feed) don't work in 
multiline mode. They work fine in single line mode because the line separation 
is done by Hadoop, which can handle all the different types of line separators. 
This fixes it by enabling Univocity's line separator detection.

## How was this patch tested?

Unit test with a file with crlf line endings.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/justinuang/spark fix-clrf-multiline

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22503.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22503


commit 5ce9de9f789ce108f6afb65e38bab44acc77a4e8
Author: Justin Uang 
Date:   2018-09-20T20:41:35Z

Fix multiline crlf




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.

2017-10-30 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/19591
  
Really looking forward to this PR! For our use case, it will reduce our 
spark launch times by ~4 seconds.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-08-15 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/15009
  
That would be incredible. Launching a new jvm and loading all of hadoop 
takes about 4 seconds extra each time, versus reusing the launcher jvm, which 
is really significant for us since we launch a lot of jobs and users have to 
wait on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-08-11 Thread justinuang
Github user justinuang commented on the issue:

https://github.com/apache/spark/pull/15009
  
@kishorvpatil this will be quite useful for us! To avoid the 3s cost of 
spinning up a new jvm just for yarn-cluster


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-10-05 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r41220284
  
--- Diff: python/setup.py ---
@@ -0,0 +1,18 @@
+#!/usr/bin/env python
+
+from setuptools import setup
+
+exec(compile(open("pyspark/pyspark_version.py").read(), 
+   "pyspark/pyspark_version.py", 'exec'))
+VERSION = __version__
+
+setup(name='pyspark',
+version=VERSION,
+description='Apache Spark Python API',
+author='Spark Developers',
+author_email='d...@spark.apache.org',
+url='https://github.com/apache/spark/tree/master/python',
+packages=['pyspark', 'pyspark.mllib', 'pyspark.ml', 'pyspark.sql', 
'pyspark.streaming'],
+install_requires=['numpy>=1.7', 'py4j==0.8.2.1', 'pandas'],
--- End diff --

pyspark.sql does depend on pandas right? toPandas()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-10-05 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r41221064
  
--- Diff: python/pyspark/pyspark_version.py ---
@@ -0,0 +1,17 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+__version__ = '1.5.0'
--- End diff --

How is the version number specified for the scala side now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-29 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-144187766
  
Thanks for the reminder!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-29 Thread justinuang
Github user justinuang closed the pull request at:

https://github.com/apache/spark/pull/8662


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...

2015-09-21 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8833#discussion_r40048503
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1414,7 +1414,7 @@ def __init__(self, func, returnType, name=None):
 def _create_judf(self, name):
 f, returnType = self.func, self.returnType  # put them in closure 
`func`
 func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), 
it)
-ser = AutoBatchedSerializer(PickleSerializer())
+ser = BatchedSerializer(PickleSerializer(), 100)
--- End diff --

Good point, I was still thinking about my first attempt which involved a 
blocking queue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...

2015-09-21 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8833#issuecomment-142161491
  
lgtm! So this avoids deadlock by getting rid of the blocking queue (duh!) 
and then assumes the OS buffer will rate limit how much gets buffered on the 
writer side?

Looking forward to getting this fix in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...

2015-09-20 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8833#discussion_r39933648
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1414,7 +1414,7 @@ def __init__(self, func, returnType, name=None):
 def _create_judf(self, name):
 f, returnType = self.func, self.returnType  # put them in closure 
`func`
 func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), 
it)
-ser = AutoBatchedSerializer(PickleSerializer())
+ser = BatchedSerializer(PickleSerializer(), 100)
--- End diff --

Can we pull this out in a constant? And also the same value in the Python, 
and put a comment on each saying that they have to equal? It's very dangerous 
if this value goes out of sync.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-141121225
  
@rxin what do you mean by local iterators =) I feel like i'm missing some 
context that you guys have


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-141117878
  
The solution with the iterator wrapper was my first approach that I 
prototyped 
(http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html).
 It's dangerous because there is buffering at many levels, in which case we can 
run into a deadlock situation.

- NEW: the ForkingIterator LinkedBlockingDeque
- batching the rows before pickling them
- os buffers on both sides
- pyspark.serializers.BatchedSerializer

We can avoid deadlock by being very disciplined. For example, we can 
have the ForkingIterator instead always do a check of whether the 
LinkedBlockingDeque is full and if so:

Java
- flush the java pickling buffer
- send a flush command to the python process
- os.flush the java side

Python
- flush BatchedSerializer
- os.flush()

I'm not sure that this udf performance regression for one UDF is going to 
hit many people. For one, most upstreams are not a range() call, which doesn't 
have to go back to disk and deserialize. My personal opinion is that the 
blocking performance shouldn't be the reason that we reject this approach, but 
because it adds complexity.

If we want a quick fix that is safe, I would be in favor of passing the 
row, which indeed is slower, but better than deadlocking or calculating 
upstream twice. It's just that the current system is unacceptable.

Maybe we can also consider going with a complete architecture shift that 
goes with a batching system, but uses thrift to serialize the scala types to a 
language agnostic format, and also handle the blocking RPC. Then we can have 
PySpark and SparkR using the same simple UDF architecture. The main drawback is 
that I'm not sure how we're going to support broadcast variables or 
aggregators, but should those even be supported with UDFs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-141232211
  
I'm not sure there is a solution that satisfies all the requirements. I can 
say that this approach addresses 1,2,4 by design.

Would you guys support a 1.6.0 UDF implementation that uses thrift for the 
RPC and serialization? In general, I think the custom-rolled socket, 
serialization, and cleanup approach as pretty scary. They're already solved 
problems, and then we can support multiple language bindings at the DataFrame 
level, where I think it's a lot easier to implement. We could even support 
broadcast variables by allowing language bindings to store bytes in the UDF 
that will be passed back to them. I don't think we need to support accumulators 
right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8318#issuecomment-140871937
  
Thanks! Sorry for being demanding, was just hoping to get this into 1.6.0!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-140920743
  
@davies how do I have a private class in python?

In addition, is it possible that the failing unit test is flaky? I ran

./run-tests --python-executables=python

and it succeeds locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-140936346
  
Hey davies, I think the performance regression for a single UDF may be 
because there were multiple threads per task that could potentially be taking 
up CPU time. I highly doubt that the actual IO using loopback is actually add 
much time, compared to the time of deserializing and serializing the individual 
items in the row.

The other approach of passing the entire row can potentially be okay, and 
it doesn't add a lot of changes to PythonRDD and Python UDFs, but I'm afraid 
that the cost of serializing the entire row can be prohibitive. After all, 
isn't serialization from in-memory jvm types to the pickled representations the 
most expensive part? What if I have a giant row of 100 columns, and I only want 
to do a UDF on one column? Do I need to serialize the entire row to pickle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8318#issuecomment-140866466
  
What is this blocking on?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-15 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-140413982
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-15 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-140416126
  
@rxin or @davies why is this automatically not retriggering when i push a 
new commit? Also, looks like the "retest this please" only works with 
committers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-14 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-140223207
  
Looks like your intuition was right. The second time it's slightly faster, 
so I ran the loop twice and took the 2nd's numbers

Here are the updated numbers

With fix
Number of udfs: 0 - 0.0953350067139
Number of udfs: 1 - 1.73201990128
Number of udfs: 2 - 3.41883206367
Number of udfs: 3 - 5.24572992325
Number of udfs: 4 - 6.83000802994
Number of udfs: 5 - 8.59465384483

Without fix
Number of udfs: 0 - 0.0891687870026
Number of udfs: 1 - 1.53674888611
Number of udfs: 2 - 4.44895505905
Number of udfs: 3 - 10.0561971664
Number of udfs: 4 - 21.5314221382
Number of udfs: 5 - 43.887141943 

It does look like there's a tiny performance drop for 1 udf. My guess is 
that it's slightly slower because the initial approach was slightly cheating 
with CPU time. It had 3 threads that could do computation at once. However, 
this is breaking with the RDD abstraction that each partition should only get 
one thread each to do CPU work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-14 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-140181530
  
Sorry for the delay, here is the code I ran and here are the results

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
import time
mult = udf(lambda x: 2 * x, IntegerType())

for i in range(0,6):
df = sqlContext.range(100).withColumnRenamed("id", "f")
for j in range(i):
df = df.select(mult(df.f).alias('f'))

start = time.time()
df.count() # make sure the Python UDF is evaluated
used = time.time() - start
print "Number of udfs: {} - {}".format(i, used)

The results are as expected. The python overhead is about 1.5 seconds, but 
you can see how the time becomes exponential for without fix, since the cost of 
calculating upstream twice includes expensive python operations themselves.

With fix
Number of udfs: 0 - 0.091050863266
Number of udfs: 1 - 1.72215199471
Number of udfs: 2 - 3.32698297501
Number of udfs: 3 - 5.64863801003
Number of udfs: 4 - 7.06328701973
Number of udfs: 5 - 9.22025489807

Without fix
Number of udfs: 0 - 1.00539588928
Number of udfs: 1 - 3.12671899796
Number of udfs: 2 - 5.91188406944
Number of udfs: 3 - 11.124516964
Number of udfs: 4 - 24.3277280331
Number of udfs: 5 - 47.621573925



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-139331466
  
Hey davies, I don't have any numbers. Are there any benchmarks that we can 
just rerun?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-139332688
  
Is there an example of another benchmark? I'm not sure where they're stored 
for python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-09 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-139023500
  
Should the build have started by now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-08 Thread justinuang
GitHub user justinuang opened a pull request:

https://github.com/apache/spark/pull/8662

[SPARK-8632] [SQL] [PYSPARK] Poor Python UDF performance because of R…

…DD caching

- I wanted to reuse most of the logic from PythonRDD, so I pulled out
  two methods, writeHeaderToStream and readPythonProcessSocket
- The worker.py now has a switch where it reads an int that either tells
  it to go into normal pyspark RDD mode, which is meant for a streaming
  two thread workflow, and pyspark UDF mode, which is meant to be called
  synchronously

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/justinuang/spark feature/pyspark_udf

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8662.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8662


commit af5254b0fd4a11696f248d148c650f157496af6e
Author: Justin Uang <ju...@palantir.com>
Date:   2015-09-08T04:23:14Z

[SPARK-8632] [SQL] [PYSPARK] Poor Python UDF performance because of RDD 
caching

- I wanted to reuse most of the logic from PythonRDD, so I pulled out
  two methods, writeHeaderToStream and readPythonProcessSocket
- The worker.py now has a switch where it reads an int that either tells
  it to go into normal pyspark RDD mode, which is meant for a streaming
  two thread workflow, and pyspark UDF mode, which is meant to be called
  synchronously




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-08 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8662#issuecomment-138758861
  
@davies @JoshRosen @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10447][WIP][PYSPARK] upgrade pyspark to...

2015-09-07 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8615#issuecomment-138326562
  
I think we are missing some of the references to 0.8.2.1

git grep py4j-

LICENSE:For Py4J (python/lib/py4j-0.8.2.1-src.zip)
bin/pyspark:export 
PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
bin/pyspark2.cmd:set 
PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.9-src.zip;%PYTHONPATH%
core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala:  
pythonPath += Seq(sparkHome, "python", "lib", 
"py4j-0.9-src.zip").mkString(File.separator)
core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala:
thread.setName("py4j-gateway-init")
python/docs/Makefile:export PYTHONPATH=$(realpath ..):$(realpath 
../lib/py4j-0.9-src.zip)
sbin/spark-config.sh:export 
PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"
yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:
val py4jFile = new File(pyLibPath, "py4j-0.9-src.zip")
yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:  
"py4j-0.9-src.zip not found; cannot run pyspark application in YARN mode.")

yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:
s"$sparkHome/python/lib/py4j-0.9-src.zip",



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37563567
  
--- Diff: python/pyspark/__init__.py ---
@@ -36,6 +36,31 @@
   Finer-grained cache persistence levels.
 
 
+import os
+import sys
+
+import xml.etree.ElementTree as ET
+
+if (os.environ.get(SPARK_HOME, not found) == not found):
--- End diff --

I agree with @alope107. In addition, if people are using spark-submit, then 
this isn't necessary right? spark-submit sets up SPARK_HOME automatically.

Are people launching python apps frequently without using spark-submit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37524370
  
--- Diff: python/setup.py ---
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+
+from setuptools import setup
+
+exec(compile(open(pyspark/pyspark_version.py).read(), 
+   pyspark/pyspark_version.py, 'exec'))
+VERSION = __version__
+
+setup(name='pyspark',
+version=VERSION,
+description='Apache Spark Python API',
+author='Spark Developers',
+author_email='d...@spark.apache.org',
+url='https://github.com/apache/spark/tree/master/python',
+packages=['pyspark', 'pyspark.mllib', 'pyspark.ml', 'pyspark.sql', 
'pyspark.streaming'],
+data_files=[('pyspark', ['pyspark/pyspark_version.py'])],
--- End diff --

Why do we need to treat pyspark_version.py differently and have it under 
data_files?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37524804
  
--- Diff: python/pyspark/pyspark_version.py ---
@@ -0,0 +1,17 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+__version__ = '1.5.0'
--- End diff --

An alternative, but trickier, idea would be to have mvn's pom.xml version 
be the authoritative one, but during the build process, it somehow adds or 
modifies that file to match the version (maybe using mvn resource filtering?). 
This would break being able to just pip install -e python in development 
mode, since people would remember to have to run the mvn command to sync the 
file over, but at least there is no risk of them going out of sync in the build.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/8318#issuecomment-132998065
  
@holdenk , thanks for working on this! Do we have plans to set up PyPI 
publishing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37570377
  
--- Diff: python/pyspark/__init__.py ---
@@ -36,6 +36,31 @@
   Finer-grained cache persistence levels.
 
 
+import os
+import sys
+
+import xml.etree.ElementTree as ET
+
+if (os.environ.get(SPARK_HOME, not found) == not found):
--- End diff --

I don't really understand this part of the code, so it would be nice to get 
some core devs to chime in, but it looks like

./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

does a lot of logic, especially when deploying against YARN that seems 
important.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37570006
  
--- Diff: python/pyspark/__init__.py ---
@@ -36,6 +36,31 @@
   Finer-grained cache persistence levels.
 
 
+import os
+import sys
+
+import xml.etree.ElementTree as ET
+
+if (os.environ.get(SPARK_HOME, not found) == not found):
--- End diff --

That's possible via the following:

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook'  
spark-1.4.0-bin-hadoop2.4/bin/pyspark

Not completely discoverable, but it works =)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37570574
  
--- Diff: python/pyspark/pyspark_version.py ---
@@ -0,0 +1,17 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+__version__ = '1.5.0'
--- End diff --

We still need to build a sdist and wheel, so we can just make sure that 
whatever process we use adds that file in. Not sure if it's really worth the 
complexity at this moment, but my team does something internally such that our 
python and java code both get semantic versions based off of the latest tag and 
the git hash.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-19 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37459103
  
--- Diff: python/pyspark/__init__.py ---
@@ -36,6 +36,33 @@
   Finer-grained cache persistence levels.
 
 
+import os
+import sys
+
+import xml.etree.ElementTree as ET
+
+if (os.environ.get(SPARK_HOME, not found) == not found):
+raise ImportError(Environment variable SPARK_HOME is undefined.)
+
+spark_home = os.environ['SPARK_HOME']
+pom_xml_file_path = spark_home + '/pom.xml'
+
+try:
+tree = ET.parse(pom_xml_file_path)
+root = tree.getroot()
+version_tag = root[4].text
+snapshot_version = version_tag[:5]
+except:
+raise ImportError(Could not read the spark version, because pom.xml 
file +
+   is not found in SPARK_HOME(%s) directory. % 
(spark_home))
+
+from pyspark.pyspark_version import __version__
+if (snapshot_version != __version__):
+raise ImportError(Incompatible version of Spark(%s) and PySpark(%s). 
%
+  (snapshot_version, __version__))
+
+sys.path.insert(0, os.path.join(os.environ[SPARK_HOME], 
python/lib/py4j-0.8.1-src.zip))
--- End diff --

We don't need this anymore, presumably if they pip installed the package, 
then py4j will already be installed in site-packages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-19 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37458924
  
--- Diff: python/setup.py ---
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+
+from setuptools import setup
+
+exec(compile(open(pyspark/pyspark_version.py).read(), 
+   pyspark/pyspark_version.py, 'exec'))
+VERSION = __version__
+
+setup(name = 'pyspark',
+   version = VERSION,
--- End diff --

why are we using three spaces for indentation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-19 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/8318#discussion_r37459213
  
--- Diff: python/pyspark/__init__.py ---
@@ -36,6 +36,33 @@
   Finer-grained cache persistence levels.
 
 
+import os
+import sys
+
+import xml.etree.ElementTree as ET
+
+if (os.environ.get(SPARK_HOME, not found) == not found):
+raise ImportError(Environment variable SPARK_HOME is undefined.)
+
+spark_home = os.environ['SPARK_HOME']
+pom_xml_file_path = spark_home + '/pom.xml'
--- End diff --

os.path.join


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7899][PYSPARK] Fix Python 3 pyspark/sql...

2015-05-27 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6439#discussion_r31166415
  
--- Diff: python/run-tests ---
@@ -57,54 +57,54 @@ function run_test() {
 
 function run_core_tests() {
 echo Run core tests ...
-run_test pyspark/rdd.py
-run_test pyspark/context.py
-run_test pyspark/conf.py
-PYSPARK_DOC_TEST=1 run_test pyspark/broadcast.py
--- End diff --

Why did you remove the PYSPARK_DOC_TEST env var?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7899][PYSPARK] Fix Python 3 pyspark/sql...

2015-05-27 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6439#discussion_r31169697
  
--- Diff: python/run-tests ---
@@ -57,54 +57,54 @@ function run_test() {
 
 function run_core_tests() {
 echo Run core tests ...
-run_test pyspark/rdd.py
-run_test pyspark/context.py
-run_test pyspark/conf.py
-PYSPARK_DOC_TEST=1 run_test pyspark/broadcast.py
--- End diff --

alright, sounds good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7329][MLLIB] simplify ParamGridBuilder ...

2015-05-03 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/5873#issuecomment-98541714
  
You can consider using set equality for the test, but other than that, it 
looks good! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamG...

2015-05-02 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/5601#issuecomment-98386185
  
Yea, you should try rebasing. It looks like you're not the only one running 
into this.


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31639/console


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamG...

2015-04-30 Thread justinuang
Github user justinuang commented on a diff in the pull request:

https://github.com/apache/spark/pull/5601#discussion_r29482941
  
--- Diff: python/pyspark/ml/tuning.py ---
@@ -0,0 +1,94 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+__all__ = ['ParamGridBuilder']
+
+
+class ParamGridBuilder(object):
+
+Builder for a param grid used in grid search-based model selection.
+
+ from classification import LogisticRegression
+ lr = LogisticRegression()
+ output = ParamGridBuilder().baseOn({lr.labelCol: 'l'}) \
+.baseOn([lr.predictionCol, 'p']) \
+.addGrid(lr.regParam, [1.0, 2.0, 3.0]) \
+.addGrid(lr.maxIter, [1, 5]) \
+.addGrid(lr.featuresCol, ['f']) \
+.build()
+ expected = [ \
+{lr.regParam: 1.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', 
lr.predictionCol: 'p'}, \
+{lr.regParam: 2.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', 
lr.predictionCol: 'p'}, \
+{lr.regParam: 3.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', 
lr.predictionCol: 'p'}, \
+{lr.regParam: 1.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', 
lr.predictionCol: 'p'}, \
+{lr.regParam: 2.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', 
lr.predictionCol: 'p'}, \
+{lr.regParam: 3.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', 
lr.predictionCol: 'p'}]
+ fail_count = 0
+ for e in expected:
+... if e not in output:
+... fail_count += 1
+ if len(expected) != len(output):
+... fail_count += 1
+ fail_count
+0
+
+
+def __init__(self):
+self._param_grid = {}
+
+def addGrid(self, param, values):
+
+Sets the given parameters in this grid to fixed values.
+
+self._param_grid[param] = values
+
+return self
+
+def baseOn(self, *args):
+
+Sets the given parameters in this grid to fixed values.
+Accepts either a parameter dictionary or a list of (parameter, 
value) pairs.
+
+if isinstance(args[0], dict):
+self.baseOn(*args[0].items())
+else:
+for (param, value) in args:
+self.addGrid(param, [value])
+
+return self
+
+def build(self):
+
+Builds and returns all combinations of parameters specified
+by the param grid.
+
+param_maps = [{}]
+for (param, values) in self._param_grid.items():
--- End diff --

Consider doing this

[dict(zip(self._param_grid.keys(), prod)) for prod in 
itertools.product(*self._param_grid.values())]

To avoid the overhead of lots of dictionary copies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2213][SQL] Sort Merge Join

2015-02-15 Thread justinuang
Github user justinuang commented on the pull request:

https://github.com/apache/spark/pull/3173#issuecomment-74416823
  
Hi, this looks great! Is there a reason why sort based join is not in spark 
core, only in spark SQL?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org