[GitHub] spark pull request #23179: Fix the rat excludes on .policy.yml
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/23179 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23179: Fix the rat excludes on .policy.yml
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/23179 Fix the rat excludes on .policy.yml ## What changes were proposed in this pull request? Fix the rat excludes on .policy.yml You can merge this pull request into a Git repository by running: $ git pull https://github.com/palantir/spark juang/fix-rat-policy-yml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23179.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23179 commit 78b34b40a7e034dd641418b804e6e2606b216ba4 Author: Robert Kruszewski Date: 2018-04-01T12:41:47Z Fix publish after k8s rebase (#347) commit 2788441fb6f945d1d945caa4675c97b8b2f5a472 Author: Patrick Woody Date: 2018-04-02T17:54:15Z Revert "transformexpression with origin" (#350) commit 4cc4dee11883bf1954181ec808f0f57a9ee55c55 Author: Patrick Woody Date: 2018-04-02T17:54:25Z Add reminder for upstream ticket/PR to github template (#351) commit 078066bdc9a77dd0c241fae544806d043cb0b167 Author: Robert Kruszewski Date: 2018-03-31T14:25:56Z resolve conflicts commit fe35b58a9e8b1bdde111b542371123907686ba97 Author: mcheah Date: 2018-04-02T21:37:23Z Empty commit to clear Circle cache. commit 1264fb5908d3eab2cccfaf9b22b6975c7afd20d4 Author: mcheah Date: 2018-04-03T00:29:36Z Empty commit to tag 2.4.0-palantir.12 and trigger publish. commit b7410ba819d4e3e37f59e8f5df0d47e78c92a362 Author: Robert Kruszewski Date: 2018-04-03T14:02:21Z Fix circle checkout for tags (#352) commit 6da0b8266906f3e1c804627c9a009a18ed102874 Author: Robert Kruszewski Date: 2018-04-03T17:32:23Z Merge pull request #346 from palantir/rk/upstream Update to upstream commit 7b12f6367dbf5d5b1da06aa0cf204658de2ebbe7 Author: Bryan Cutler Date: 2018-04-02T16:53:37Z [SPARK-15009][PYTHON][FOLLOWUP] Add default param checks for CountVectorizerModel ## What changes were proposed in this pull request? Adding test for default params for `CountVectorizerModel` constructed from vocabulary. This required that the param `maxDF` be added, which was done in SPARK-23615. ## How was this patch tested? Added an explicit test for CountVectorizerModel in DefaultValuesTests. Author: Bryan Cutler Closes #20942 from BryanCutler/pyspark-CountVectorizerModel-default-param-test-SPARK-15009. commit 60e1bd62d72cc5fadbfc96ad6b1f3b84bd36335e Author: David Vogelbacher Date: 2018-04-02T19:00:37Z [SPARK-23825][K8S] Requesting memory + memory overhead for pod memory ## What changes were proposed in this pull request? Kubernetes driver and executor pods should request `memory + memoryOverhead` as their resources instead of just `memory`, see https://issues.apache.org/jira/browse/SPARK-23825 ## How was this patch tested? Existing unit tests were adapted. Author: David Vogelbacher Closes #20943 from dvogelbacher/spark-23825. commit 08f64b4048072a97a92dca94ded78f2de46525f2 Author: Yinan Li Date: 2018-04-02T19:20:55Z [SPARK-23285][K8S] Add a config property for specifying physical executor cores ## What changes were proposed in this pull request? As mentioned in SPARK-23285, this PR introduces a new configuration property `spark.kubernetes.executor.cores` for specifying the physical CPU cores requested for each executor pod. This is to avoid changing the semantics of `spark.executor.cores` and `spark.task.cpus` and their role in task scheduling, task parallelism, dynamic resource allocation, etc. The new configuration property only determines the physical CPU cores available to an executor. An executor can still run multiple tasks simultaneously by using appropriate values for `spark.executor.cores` and `spark.task.cpus`. ## How was this patch tested? Unit tests. felixcheung srowen jiangxb1987 jerryshao mccheah foxish Author: Yinan Li Author: Yinan Li Closes #20553 from liyinan926/master. commit 8a307d1b4db5ed9e6634142002139945ff3a79bd Author: Kazuaki Ishizaki Date: 2018-04-02T19:48:44Z [SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes ## What changes were proposed in this pull request? This PR implemented the following cleanups related to `UnsafeWriter` class: - Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter` - Make `BufferHolder` class internal by delegating its accessor methods to `UnsafeWriter` - Replace `UnsafeRow.setTotalSize(...)` with `UnsafeRowWriter.setTotalSize()` ## How was this patch tested? Tested by existing UTs Author: Kazuaki Ishizaki Closes #20850 from kiszk/SPARK-2371
[GitHub] spark issue #20877: [SPARK-23765][SQL] Supports custom line separator for js...
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/20877 Sorry, I won't be able to take it over! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23051: [AE2.3-02][SPARK-23128] Add QueryStage and the fr...
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/23051 [AE2.3-02][SPARK-23128] Add QueryStage and the framework for adaptive execution (auto setting the number of reducer) ## What changes were proposed in this pull request? Add QueryStage and the framework for adaptive execution. The main benefit from this PR is that the reducer count is set automatically based on a target file size. We got this PR (branch ae-02) from https://github.com/Intel-bigdata/spark-adaptive/pull/43, which is based on branch ae-01, but I decided to not merge those in because they require invasive changes to spark-core, and a protocol change to the external shuffle service. This PR should be relatively safe to merge because most of the code changes are to adaptive query execution, which isn't turned on by default. ## How was this patch tested? Unit tests Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/palantir/spark juang/cherry-pick-ae-02 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23051.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23051 commit 7827060679786afc02e7e6e3d778c2fcb2c13db9 Author: Dan Sanduleac Date: 2018-03-24T00:40:20Z v1-maven-build-with-version should cache by revision not buildNum since it needs to be common between different jobs commit 06290c19132f0be5a3e5f6a32b3c4458beadc394 Author: Dan SÄnduleac Date: 2018-03-24T20:19:19Z Ignore flaky scala tests as well as hive tests (#335) commit 2bc8fafe45711a64a56f4d031e75dc609c5314e6 Author: Dan SÄnduleac Date: 2018-03-25T18:16:16Z Treat classnames with only skipped tests as having taken 0 time (#336) commit 2c8c96be2b9719cc998d113d5c7cabf6c51a2403 Author: Robert Kruszewski Date: 2018-03-26T09:38:09Z Force okhttp logging interceptor(#337) commit 4c99e6354198ec11b46ffc38014fdab6b55dcffd Author: Dan SÄnduleac Date: 2018-03-26T11:13:07Z Handle nulls in k8s responses correctly (#334) commit cf31e8342e5c0b771c2b5dcb3b9a86540adf1f92 Author: Dan SÄnduleac Date: 2018-03-26T12:18:15Z Store/restore ~/.m2 after versioned build (since pom.xml changes) (#339) commit de656d21c658bad0b7f873e9da541b7cb303c5fa Author: Dan SÄnduleac Date: 2018-03-27T00:12:43Z build-sbt directly, and don't restore build-maven where not necessary (#340) commit d531f534734226dc65be04ec9e9714792afa983c Author: Dan Sanduleac Date: 2018-03-27T11:59:34Z empty commit commit 1aeaf27ae65ff3f625235b48a3a0e75d0a3fbb11 Author: Dan SÄnduleac Date: 2018-03-28T12:46:12Z Faster deploy by parallelizing maven and skipping unnecessary second 'mvn package' (#342) commit 44a14cdafe247f7094d7571e00cfd8e85bf0e397 Author: Jeremy Liu Date: 2018-03-28T19:54:33Z Move RBackend to member variable commit 5d88c9527b602728ccaf0a48d0106b2729d46a2a Author: Dan SÄnduleac Date: 2018-03-29T14:02:06Z [SPARK-23795][LAUNCHER] Make AbstractLauncher#self() protected (#341) commit 41415d4865b625da8516739e0e63acdb1137a3b0 Author: mcheah Date: 2018-03-07T01:59:03Z Rebase to upstream's version of Kubernetes support. commit 4ac24329b53e51cdc3990f634ed7a2249c8423e3 Author: mcheah Date: 2018-03-12T20:46:21Z Replace manifest commit 6d23bae6fcccb483128c9d70438653b0c239c8c6 Author: Ilan Filonenko Date: 2018-03-19T18:29:56Z [SPARK-22839][K8S] Remove the use of init-container for downloading remote dependencies Removal of the init-container for downloading remote dependencies. Built off of the work done by vanzin in an attempt to refactor driver/executor configuration elaborated in [this](https://issues.apache.org/jira/browse/SPARK-22839) ticket. This patch was tested with unit and integration tests. Author: Ilan Filonenko Closes #20669 from ifilonenko/remove-init-container. commit 1d60e389e6b84b158a91e1a9cdeeb124949c4d07 Author: mcheah Date: 2018-03-29T18:51:02Z Match entrypoint as well commit 5774deb0022235455e84387a304fa6823f939f74 Author: amenck Date: 2018-03-29T19:56:56Z Merge pull request #343 from jeremyjliu/jl/expose-r-backend Move RBackend to member variable commit 4e7f4f09512a5a30f72ed679fad594f87b12db75 Author: Dan SÄnduleac Date: 2018-03-29T20:29:52Z Properly remove hive from modules (#338) commit 95cf5f7523f60cfdd7fdc9d76dfd2668c287785c Author: mccheah Date: 2018-03-29T22:57:40Z Merge pull request #324 from palantir/use-upstream-kubernetes Rebase to upstream's version of Kubernetes support. commit a7383de811ea01f60aeb642b6c192aecef14ff6a Author: Robert Kru
[GitHub] spark pull request #23051: [AE2.3-02][SPARK-23128] Add QueryStage and the fr...
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/23051 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22968: Merge upstream
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/22968 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22968: Merge upstream
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/22968 Merge upstream ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/palantir/spark juang/merge-easy-upstream Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22968.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22968 commit 349a63c12acc919dbd25cc54fe75230b6f16224a Author: Dan SÄnduleac Date: 2018-03-23T19:41:50Z Improve bin packing and reduce scala parallelism (#333) commit 7827060679786afc02e7e6e3d778c2fcb2c13db9 Author: Dan Sanduleac Date: 2018-03-24T00:40:20Z v1-maven-build-with-version should cache by revision not buildNum since it needs to be common between different jobs commit 06290c19132f0be5a3e5f6a32b3c4458beadc394 Author: Dan SÄnduleac Date: 2018-03-24T20:19:19Z Ignore flaky scala tests as well as hive tests (#335) commit 2bc8fafe45711a64a56f4d031e75dc609c5314e6 Author: Dan SÄnduleac Date: 2018-03-25T18:16:16Z Treat classnames with only skipped tests as having taken 0 time (#336) commit 2c8c96be2b9719cc998d113d5c7cabf6c51a2403 Author: Robert Kruszewski Date: 2018-03-26T09:38:09Z Force okhttp logging interceptor(#337) commit 4c99e6354198ec11b46ffc38014fdab6b55dcffd Author: Dan SÄnduleac Date: 2018-03-26T11:13:07Z Handle nulls in k8s responses correctly (#334) commit cf31e8342e5c0b771c2b5dcb3b9a86540adf1f92 Author: Dan SÄnduleac Date: 2018-03-26T12:18:15Z Store/restore ~/.m2 after versioned build (since pom.xml changes) (#339) commit de656d21c658bad0b7f873e9da541b7cb303c5fa Author: Dan SÄnduleac Date: 2018-03-27T00:12:43Z build-sbt directly, and don't restore build-maven where not necessary (#340) commit d531f534734226dc65be04ec9e9714792afa983c Author: Dan Sanduleac Date: 2018-03-27T11:59:34Z empty commit commit 1aeaf27ae65ff3f625235b48a3a0e75d0a3fbb11 Author: Dan SÄnduleac Date: 2018-03-28T12:46:12Z Faster deploy by parallelizing maven and skipping unnecessary second 'mvn package' (#342) commit 44a14cdafe247f7094d7571e00cfd8e85bf0e397 Author: Jeremy Liu Date: 2018-03-28T19:54:33Z Move RBackend to member variable commit 5d88c9527b602728ccaf0a48d0106b2729d46a2a Author: Dan SÄnduleac Date: 2018-03-29T14:02:06Z [SPARK-23795][LAUNCHER] Make AbstractLauncher#self() protected (#341) commit 41415d4865b625da8516739e0e63acdb1137a3b0 Author: mcheah Date: 2018-03-07T01:59:03Z Rebase to upstream's version of Kubernetes support. commit 4ac24329b53e51cdc3990f634ed7a2249c8423e3 Author: mcheah Date: 2018-03-12T20:46:21Z Replace manifest commit 6d23bae6fcccb483128c9d70438653b0c239c8c6 Author: Ilan Filonenko Date: 2018-03-19T18:29:56Z [SPARK-22839][K8S] Remove the use of init-container for downloading remote dependencies Removal of the init-container for downloading remote dependencies. Built off of the work done by vanzin in an attempt to refactor driver/executor configuration elaborated in [this](https://issues.apache.org/jira/browse/SPARK-22839) ticket. This patch was tested with unit and integration tests. Author: Ilan Filonenko Closes #20669 from ifilonenko/remove-init-container. commit 1d60e389e6b84b158a91e1a9cdeeb124949c4d07 Author: mcheah Date: 2018-03-29T18:51:02Z Match entrypoint as well commit 5774deb0022235455e84387a304fa6823f939f74 Author: amenck Date: 2018-03-29T19:56:56Z Merge pull request #343 from jeremyjliu/jl/expose-r-backend Move RBackend to member variable commit 4e7f4f09512a5a30f72ed679fad594f87b12db75 Author: Dan SÄnduleac Date: 2018-03-29T20:29:52Z Properly remove hive from modules (#338) commit 95cf5f7523f60cfdd7fdc9d76dfd2668c287785c Author: mccheah Date: 2018-03-29T22:57:40Z Merge pull request #324 from palantir/use-upstream-kubernetes Rebase to upstream's version of Kubernetes support. commit a7383de811ea01f60aeb642b6c192aecef14ff6a Author: Robert Kruszewski Date: 2018-03-30T02:30:51Z mapexpressions preserving origin commit 479cf4633bc415b33fd80fc969be885b0decc5cb Author: Robert Kruszewski Date: 2018-03-30T02:34:47Z correct place commit bb10a57784784fa0f661540aa5cf3acb4dad7651 Author: mccheah Date: 2018-03-30T19:13:37Z Merge pull requ
[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/22503#discussion_r226386187 --- Diff: sql/core/src/test/resources/test-data/cars-crlf.csv --- @@ -0,0 +1,7 @@ + +year,make,model,comment,blank +"2012","Tesla","S","No comment", + +1997,Ford,E350,"Go get one now they are going fast", +2015,Chevy,Volt + --- End diff -- Yea, if you open it with `vi -b sql/core/src/test/resources/test-data/cars-crlf.csv`, you can see the `^M` characters which according to [this](https://stackoverflow.com/questions/3860519/see-line-breaks-and-carriage-returns-in-editor) represents a CLRF. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 done! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 So Hadoop's LineReader looks like it handles CR, LF, CRLF: https://github.com/apache/hadoop/blob/f90c64e6242facf38c2baedeeda42e4a8293e642/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L36 Univocity handles CR, LF, CRLF (the logic is a bit convoluted but it looks like they have the same behavior in that if they see a CR, they will look for a LF next): https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/input/LineSeparatorDetector.java I do agree we should expose the option of `setLineSeparator`, but regardless of that, the default behavior of handling CR, LF, CRLF should be the same between single line and multiline mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22680: [SPARK-25493][SQL] Use auto-detection for CRLF in...
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/22680 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22680: [SPARK-25493][SQL] Use auto-detection for CRLF in...
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/22680 [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode ## Upstream SPARK-X ticket and PR link (if not applicable, explain) Went through review, but upstream is not merging. Discussed offline with @vinooganesh that we will merge here first. https://github.com/apache/spark/pull/22503/files ## What changes were proposed in this pull request? CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode. ## How was this patch tested? Unit test with a file with crlf line endings. You can merge this pull request into a Git repository by running: $ git pull https://github.com/palantir/spark palantirspark/fix-clrf-multiline Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22680.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22680 commit 8bc932a49a76d482510242a7d040fdf7e888c895 Author: Dan Sanduleac Date: 2018-03-18T15:15:59Z Delete commented out code that's no longer applicable commit cd4afe2e13a7140023cb50a8b9be798bb7b86e61 Author: Dan Sanduleac Date: 2018-03-18T15:16:22Z Bump build-sbt cache to v1-build-sbt.. think old cache causes the OOM somehow commit 047e65a0b916a756d2dc7b8106acfd94f530f07a Author: Dan Sanduleac Date: 2018-03-18T23:31:39Z Move all-project exclusion and global setting to DefaultSparkPlugin, nuke excludeDependencies commit 53d6f5aa9f8627388a732ae6dd3ced0b6192fcf3 Author: Dan Sanduleac Date: 2018-03-18T23:32:26Z Make enable() accept any DslEntry allowing enablePlugins etc not just Seq[Setting[_]] commit e154185f7e4c94da78375de4f590f2fd20610e6b Author: Dan Sanduleac Date: 2018-03-19T00:56:57Z Exclude com.sun.jersey crap but only from copyJarsProjects (assembly, examples) commit ecd06e96825220bdf8dbec3c5fa8725aae89eb7c Author: Dan Sanduleac Date: 2018-03-19T00:59:03Z revert unnecessary exclusions in hadoop-palantir/pom.xml commit bde3a2af44e9def680615aa47d82bc4decf43a18 Author: Dan Sanduleac Date: 2018-03-19T01:04:09Z ensure we update sbt before getting externalDependencyClasspath, prevent badly cached resolution results!! commit cae7f8cc4381d01074815d9c9c29bf115b855cac Author: Dan Sanduleac Date: 2018-03-19T12:25:13Z delete unnecesary sbt cache restore in deploy commit f4af82f99e0f14843c762a7f98d0f952cfb57d52 Author: Dan Sanduleac Date: 2018-03-19T12:29:44Z make home-sbt cache depend on project update inputs commit 35bebba53c5c9c7f9427334bab8a47bb9129be1c Author: Dan Sanduleac Date: 2018-03-19T12:48:59Z python / R tests also don't use SBT or maven commit bec5c1eb9cd4489662b49770c0f77d0202987479 Author: Dan Sanduleac Date: 2018-03-19T12:51:09Z fix v2-home-sbt cache, I guess it doesn't need escaping $ commit 82197f8198b44a3beea0430274e43e2d2b7509a5 Author: Dan Sanduleac Date: 2018-03-19T15:27:05Z Log which tests (per project) didn't have timings commit 00f28de1684632afcc43dd3e21781c584915509d Author: Dan Sanduleac Date: 2018-03-19T15:36:46Z Log which tests (per project) didn't have timings commit 8b7fd7ff2be699e8aebcb014e1033e12e72b585e Author: Dan Sanduleac Date: 2018-03-19T16:43:47Z parallelize python tests, and feed the right versions into packaging tests (run-pip-tests) commit 886f496483eaf673fda58a7fde932dfedd514bcd Author: Dan Sanduleac Date: 2018-03-19T17:10:47Z I guess we need to set up logging too commit 0047e3165c973b11ea6f2e63c90fcb1888e93c3a Author: Dan Sanduleac Date: 2018-03-19T18:26:10Z disable cached resolution commit 0e749a60656c7574529f5aee4d8785697f5af668 Author: Dan Sanduleac Date: 2018-03-19T23:48:43Z run all python tests before giving up, don't stop early commit 302b8351c44882c84ce1eb5fd6fb0454b4e1c276 Author: Dan Sanduleac Date: 2018-03-20T00:20:29Z Explicitly calling `update` seems to be very slow without cached resolution. If we just call externalDependencyClasspath though that might be enough commit 636bef690d799d3e60dcabfaef783d2636f84878 Author: Dan Sanduleac Date: 2018-03-22T01:15:28Z downgrade yer numpys. newer numpy breaks a bunch of tests because of different array formatting & more commit 7aaf1c6d3abdc3093f73e5f25b8f9bbc5d60b756 Author: Dan Sanduleac Date: 2018-03-22T01:25:50Z Remove commented
[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/22503#discussion_r222053706 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -212,6 +212,8 @@ class CSVOptions( settings.setEmptyValue(emptyValueInRead) settings.setMaxCharsPerColumn(maxCharsPerColumn) settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) +settings.setLineSeparatorDetectionEnabled(multiLine) --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 What does it take to get this to be merged in? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 Sounds good, thanks guys =) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 It looks like a flake? Can someone retrigger it? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96511/console --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/22503 [SPARK-25493] [SQL] Fix multiline crlf ## What changes were proposed in this pull request? CSVs with windows style crlf (carriage return line feed) don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This fixes it by enabling Univocity's line separator detection. ## How was this patch tested? Unit test with a file with crlf line endings. You can merge this pull request into a Git repository by running: $ git pull https://github.com/justinuang/spark fix-clrf-multiline Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22503.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22503 commit 5ce9de9f789ce108f6afb65e38bab44acc77a4e8 Author: Justin Uang Date: 2018-09-20T20:41:35Z Fix multiline crlf --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/19591 Really looking forward to this PR! For our use case, it will reduce our spark launch times by ~4 seconds. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/15009 That would be incredible. Launching a new jvm and loading all of hadoop takes about 4 seconds extra each time, versus reusing the launcher jvm, which is really significant for us since we launch a lot of jobs and users have to wait on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/15009 @kishorvpatil this will be quite useful for us! To avoid the 3s cost of spinning up a new jvm just for yarn-cluster --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9301] [SQL] Add collect_set aggregate f...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8592#issuecomment-152189895 @rxin I would really like to see this merged in as well. I agree with @nburoojy that collect_as_list and collect_as_set will always have issues if the size of one list/set gets too big, but that doesn't mean that this isn't useful. I didn't quite follow the other parts about the API refactor, but if that isn't a huge issue, it would be nice to merge something like this in soon. If people are careful, this can be extremely useful, especially when it's impossible to write a UDAF that has a mergeValue and mergeCombiner function, or when you just want to restructure how the table is layed-out. In addition, until pyspark gets UDAFs, this will be a good substitute for most cases. Right now, to get around this in pyspark, I'm using hive's collect_list, but it's annoying because I have to register a temp table and use a SQL query instead of the dataframe API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r41221064 --- Diff: python/pyspark/pyspark_version.py --- @@ -0,0 +1,17 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +__version__ = '1.5.0' --- End diff -- How is the version number specified for the scala side now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r41220284 --- Diff: python/setup.py --- @@ -0,0 +1,18 @@ +#!/usr/bin/env python + +from setuptools import setup + +exec(compile(open("pyspark/pyspark_version.py").read(), + "pyspark/pyspark_version.py", 'exec')) +VERSION = __version__ + +setup(name='pyspark', +version=VERSION, +description='Apache Spark Python API', +author='Spark Developers', +author_email='d...@spark.apache.org', +url='https://github.com/apache/spark/tree/master/python', +packages=['pyspark', 'pyspark.mllib', 'pyspark.ml', 'pyspark.sql', 'pyspark.streaming'], +install_requires=['numpy>=1.7', 'py4j==0.8.2.1', 'pandas'], --- End diff -- pyspark.sql does depend on pandas right? toPandas()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/8662 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-144187766 Thanks for the reminder! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8833#issuecomment-142161491 lgtm! So this avoids deadlock by getting rid of the blocking queue (duh!) and then assumes the OS buffer will rate limit how much gets buffered on the writer side? Looking forward to getting this fix in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8833#discussion_r40048503 --- Diff: python/pyspark/sql/functions.py --- @@ -1414,7 +1414,7 @@ def __init__(self, func, returnType, name=None): def _create_judf(self, name): f, returnType = self.func, self.returnType # put them in closure `func` func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it) -ser = AutoBatchedSerializer(PickleSerializer()) +ser = BatchedSerializer(PickleSerializer(), 100) --- End diff -- Good point, I was still thinking about my first attempt which involved a blocking queue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8833#discussion_r39933648 --- Diff: python/pyspark/sql/functions.py --- @@ -1414,7 +1414,7 @@ def __init__(self, func, returnType, name=None): def _create_judf(self, name): f, returnType = self.func, self.returnType # put them in closure `func` func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it) -ser = AutoBatchedSerializer(PickleSerializer()) +ser = BatchedSerializer(PickleSerializer(), 100) --- End diff -- Can we pull this out in a constant? And also the same value in the Python, and put a comment on each saying that they have to equal? It's very dangerous if this value goes out of sync. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141232211 I'm not sure there is a solution that satisfies all the requirements. I can say that this approach addresses 1,2,4 by design. Would you guys support a 1.6.0 UDF implementation that uses thrift for the RPC and serialization? In general, I think the custom-rolled socket, serialization, and cleanup approach as pretty scary. They're already solved problems, and then we can support multiple language bindings at the DataFrame level, where I think it's a lot easier to implement. We could even support broadcast variables by allowing language bindings to store bytes in the UDF that will be passed back to them. I don't think we need to support accumulators right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141121225 @rxin what do you mean by local iterators =) I feel like i'm missing some context that you guys have --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141117878 The solution with the iterator wrapper was my first approach that I prototyped (http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html). It's dangerous because there is buffering at many levels, in which case we can run into a deadlock situation. - NEW: the ForkingIterator LinkedBlockingDeque - batching the rows before pickling them - os buffers on both sides - pyspark.serializers.BatchedSerializer We can avoid deadlock by being very disciplined. For example, we can have the ForkingIterator instead always do a check of whether the LinkedBlockingDeque is full and if so: Java - flush the java pickling buffer - send a flush command to the python process - os.flush the java side Python - flush BatchedSerializer - os.flush() I'm not sure that this udf performance regression for one UDF is going to hit many people. For one, most upstreams are not a range() call, which doesn't have to go back to disk and deserialize. My personal opinion is that the blocking performance shouldn't be the reason that we reject this approach, but because it adds complexity. If we want a quick fix that is safe, I would be in favor of passing the row, which indeed is slower, but better than deadlocking or calculating upstream twice. It's just that the current system is unacceptable. Maybe we can also consider going with a complete architecture shift that goes with a batching system, but uses thrift to serialize the scala types to a language agnostic format, and also handle the blocking RPC. Then we can have PySpark and SparkR using the same simple UDF architecture. The main drawback is that I'm not sure how we're going to support broadcast variables or aggregators, but should those even be supported with UDFs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140936346 Hey davies, I think the performance regression for a single UDF may be because there were multiple threads per task that could potentially be taking up CPU time. I highly doubt that the actual IO using loopback is actually add much time, compared to the time of deserializing and serializing the individual items in the row. The other approach of passing the entire row can potentially be okay, and it doesn't add a lot of changes to PythonRDD and Python UDFs, but I'm afraid that the cost of serializing the entire row can be prohibitive. After all, isn't serialization from in-memory jvm types to the pickled representations the most expensive part? What if I have a giant row of 100 columns, and I only want to do a UDF on one column? Do I need to serialize the entire row to pickle? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140920743 @davies how do I have a private class in python? In addition, is it possible that the failing unit test is flaky? I ran ./run-tests --python-executables=python and it succeeds locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8318#issuecomment-140871937 Thanks! Sorry for being demanding, was just hoping to get this into 1.6.0! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8318#issuecomment-140866466 What is this blocking on? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140416126 @rxin or @davies why is this automatically not retriggering when i push a new commit? Also, looks like the "retest this please" only works with committers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140413982 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140223207 Looks like your intuition was right. The second time it's slightly faster, so I ran the loop twice and took the 2nd's numbers Here are the updated numbers With fix Number of udfs: 0 - 0.0953350067139 Number of udfs: 1 - 1.73201990128 Number of udfs: 2 - 3.41883206367 Number of udfs: 3 - 5.24572992325 Number of udfs: 4 - 6.83000802994 Number of udfs: 5 - 8.59465384483 Without fix Number of udfs: 0 - 0.0891687870026 Number of udfs: 1 - 1.53674888611 Number of udfs: 2 - 4.44895505905 Number of udfs: 3 - 10.0561971664 Number of udfs: 4 - 21.5314221382 Number of udfs: 5 - 43.887141943 It does look like there's a tiny performance drop for 1 udf. My guess is that it's slightly slower because the initial approach was slightly cheating with CPU time. It had 3 threads that could do computation at once. However, this is breaking with the RDD abstraction that each partition should only get one thread each to do CPU work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140181530 Sorry for the delay, here is the code I ran and here are the results from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType import time mult = udf(lambda x: 2 * x, IntegerType()) for i in range(0,6): df = sqlContext.range(100).withColumnRenamed("id", "f") for j in range(i): df = df.select(mult(df.f).alias('f')) start = time.time() df.count() # make sure the Python UDF is evaluated used = time.time() - start print "Number of udfs: {} - {}".format(i, used) The results are as expected. The python overhead is about 1.5 seconds, but you can see how the time becomes exponential for without fix, since the cost of calculating upstream twice includes expensive python operations themselves. With fix Number of udfs: 0 - 0.091050863266 Number of udfs: 1 - 1.72215199471 Number of udfs: 2 - 3.32698297501 Number of udfs: 3 - 5.64863801003 Number of udfs: 4 - 7.06328701973 Number of udfs: 5 - 9.22025489807 Without fix Number of udfs: 0 - 1.00539588928 Number of udfs: 1 - 3.12671899796 Number of udfs: 2 - 5.91188406944 Number of udfs: 3 - 11.124516964 Number of udfs: 4 - 24.3277280331 Number of udfs: 5 - 47.621573925 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139332688 Is there an example of another benchmark? I'm not sure where they're stored for python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139331466 Hey davies, I don't have any numbers. Are there any benchmarks that we can just rerun? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139023500 Should the build have started by now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-138758861 @davies @JoshRosen @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/8662 [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF performance because of R⦠â¦DD caching - I wanted to reuse most of the logic from PythonRDD, so I pulled out two methods, writeHeaderToStream and readPythonProcessSocket - The worker.py now has a switch where it reads an int that either tells it to go into normal pyspark RDD mode, which is meant for a streaming two thread workflow, and pyspark UDF mode, which is meant to be called synchronously You can merge this pull request into a Git repository by running: $ git pull https://github.com/justinuang/spark feature/pyspark_udf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8662.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8662 commit af5254b0fd4a11696f248d148c650f157496af6e Author: Justin Uang Date: 2015-09-08T04:23:14Z [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF performance because of RDD caching - I wanted to reuse most of the logic from PythonRDD, so I pulled out two methods, writeHeaderToStream and readPythonProcessSocket - The worker.py now has a switch where it reads an int that either tells it to go into normal pyspark RDD mode, which is meant for a streaming two thread workflow, and pyspark UDF mode, which is meant to be called synchronously --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10447][WIP][PYSPARK] upgrade pyspark to...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8615#issuecomment-138326562 I think we are missing some of the references to 0.8.2.1 git grep py4j- LICENSE:For Py4J (python/lib/py4j-0.8.2.1-src.zip) bin/pyspark:export PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH" bin/pyspark2.cmd:set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.9-src.zip;%PYTHONPATH% core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala: pythonPath += Seq(sparkHome, "python", "lib", "py4j-0.9-src.zip").mkString(File.separator) core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala: thread.setName("py4j-gateway-init") python/docs/Makefile:export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.9-src.zip) sbin/spark-config.sh:export PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH" yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: val py4jFile = new File(pyLibPath, "py4j-0.9-src.zip") yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala: "py4j-0.9-src.zip not found; cannot run pyspark application in YARN mode.") yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala: s"$sparkHome/python/lib/py4j-0.9-src.zip", --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37570574 --- Diff: python/pyspark/pyspark_version.py --- @@ -0,0 +1,17 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +__version__ = '1.5.0' --- End diff -- We still need to build a sdist and wheel, so we can just make sure that whatever process we use adds that file in. Not sure if it's really worth the complexity at this moment, but my team does something internally such that our python and java code both get semantic versions based off of the latest tag and the git hash. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37570377 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,31 @@ Finer-grained cache persistence levels. """ +import os +import sys + +import xml.etree.ElementTree as ET + +if (os.environ.get("SPARK_HOME", "not found") == "not found"): --- End diff -- I don't really understand this part of the code, so it would be nice to get some core devs to chime in, but it looks like ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala does a lot of logic, especially when deploying against YARN that seems important. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37570006 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,31 @@ Finer-grained cache persistence levels. """ +import os +import sys + +import xml.etree.ElementTree as ET + +if (os.environ.get("SPARK_HOME", "not found") == "not found"): --- End diff -- That's possible via the following: PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook' spark-1.4.0-bin-hadoop2.4/bin/pyspark Not completely discoverable, but it works =) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37563567 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,31 @@ Finer-grained cache persistence levels. """ +import os +import sys + +import xml.etree.ElementTree as ET + +if (os.environ.get("SPARK_HOME", "not found") == "not found"): --- End diff -- I agree with @alope107. In addition, if people are using spark-submit, then this isn't necessary right? spark-submit sets up SPARK_HOME automatically. Are people launching python apps frequently without using spark-submit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8318#issuecomment-132998065 @holdenk , thanks for working on this! Do we have plans to set up PyPI publishing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37524804 --- Diff: python/pyspark/pyspark_version.py --- @@ -0,0 +1,17 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +__version__ = '1.5.0' --- End diff -- An alternative, but trickier, idea would be to have mvn's pom.xml version be the authoritative one, but during the build process, it somehow adds or modifies that file to match the version (maybe using mvn resource filtering?). This would break being able to just "pip install -e python" in development mode, since people would remember to have to run the mvn command to sync the file over, but at least there is no risk of them going out of sync in the build. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37524370 --- Diff: python/setup.py --- @@ -0,0 +1,19 @@ +#!/usr/bin/env python + +from setuptools import setup + +exec(compile(open("pyspark/pyspark_version.py").read(), + "pyspark/pyspark_version.py", 'exec')) +VERSION = __version__ + +setup(name='pyspark', +version=VERSION, +description='Apache Spark Python API', +author='Spark Developers', +author_email='d...@spark.apache.org', +url='https://github.com/apache/spark/tree/master/python', +packages=['pyspark', 'pyspark.mllib', 'pyspark.ml', 'pyspark.sql', 'pyspark.streaming'], +data_files=[('pyspark', ['pyspark/pyspark_version.py'])], --- End diff -- Why do we need to treat pyspark_version.py differently and have it under data_files? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37459213 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,33 @@ Finer-grained cache persistence levels. """ +import os +import sys + +import xml.etree.ElementTree as ET + +if (os.environ.get("SPARK_HOME", "not found") == "not found"): +raise ImportError("Environment variable SPARK_HOME is undefined.") + +spark_home = os.environ['SPARK_HOME'] +pom_xml_file_path = spark_home + '/pom.xml' --- End diff -- os.path.join --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37459103 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,33 @@ Finer-grained cache persistence levels. """ +import os +import sys + +import xml.etree.ElementTree as ET + +if (os.environ.get("SPARK_HOME", "not found") == "not found"): +raise ImportError("Environment variable SPARK_HOME is undefined.") + +spark_home = os.environ['SPARK_HOME'] +pom_xml_file_path = spark_home + '/pom.xml' + +try: +tree = ET.parse(pom_xml_file_path) +root = tree.getroot() +version_tag = root[4].text +snapshot_version = version_tag[:5] +except: +raise ImportError("Could not read the spark version, because pom.xml file" + + " is not found in SPARK_HOME(%s) directory." % (spark_home)) + +from pyspark.pyspark_version import __version__ +if (snapshot_version != __version__): +raise ImportError("Incompatible version of Spark(%s) and PySpark(%s)." % + (snapshot_version, __version__)) + +sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], "python/lib/py4j-0.8.1-src.zip")) --- End diff -- We don't need this anymore, presumably if they pip installed the package, then py4j will already be installed in site-packages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37458924 --- Diff: python/setup.py --- @@ -0,0 +1,19 @@ +#!/usr/bin/env python + +from setuptools import setup + +exec(compile(open("pyspark/pyspark_version.py").read(), + "pyspark/pyspark_version.py", 'exec')) +VERSION = __version__ + +setup(name = 'pyspark', + version = VERSION, --- End diff -- why are we using three spaces for indentation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7899][PYSPARK] Fix Python 3 pyspark/sql...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/6439#discussion_r31169697 --- Diff: python/run-tests --- @@ -57,54 +57,54 @@ function run_test() { function run_core_tests() { echo "Run core tests ..." -run_test "pyspark/rdd.py" -run_test "pyspark/context.py" -run_test "pyspark/conf.py" -PYSPARK_DOC_TEST=1 run_test "pyspark/broadcast.py" --- End diff -- alright, sounds good! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7899][PYSPARK] Fix Python 3 pyspark/sql...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/6439#discussion_r31166415 --- Diff: python/run-tests --- @@ -57,54 +57,54 @@ function run_test() { function run_core_tests() { echo "Run core tests ..." -run_test "pyspark/rdd.py" -run_test "pyspark/context.py" -run_test "pyspark/conf.py" -PYSPARK_DOC_TEST=1 run_test "pyspark/broadcast.py" --- End diff -- Why did you remove the PYSPARK_DOC_TEST env var? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7329][MLLIB] simplify ParamGridBuilder ...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/5873#issuecomment-98541714 You can consider using set equality for the test, but other than that, it looks good! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamG...
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/5601#issuecomment-98386185 Yea, you should try rebasing. It looks like you're not the only one running into this. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31639/console --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamG...
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/5601#discussion_r29482941 --- Diff: python/pyspark/ml/tuning.py --- @@ -0,0 +1,94 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +__all__ = ['ParamGridBuilder'] + + +class ParamGridBuilder(object): +""" +Builder for a param grid used in grid search-based model selection. + +>>> from classification import LogisticRegression +>>> lr = LogisticRegression() +>>> output = ParamGridBuilder().baseOn({lr.labelCol: 'l'}) \ +.baseOn([lr.predictionCol, 'p']) \ +.addGrid(lr.regParam, [1.0, 2.0, 3.0]) \ +.addGrid(lr.maxIter, [1, 5]) \ +.addGrid(lr.featuresCol, ['f']) \ +.build() +>>> expected = [ \ +{lr.regParam: 1.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'}, \ +{lr.regParam: 2.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'}, \ +{lr.regParam: 3.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'}, \ +{lr.regParam: 1.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}, \ +{lr.regParam: 2.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}, \ +{lr.regParam: 3.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}] +>>> fail_count = 0 +>>> for e in expected: +... if e not in output: +... fail_count += 1 +>>> if len(expected) != len(output): +... fail_count += 1 +>>> fail_count +0 +""" + +def __init__(self): +self._param_grid = {} + +def addGrid(self, param, values): +""" +Sets the given parameters in this grid to fixed values. +""" +self._param_grid[param] = values + +return self + +def baseOn(self, *args): +""" +Sets the given parameters in this grid to fixed values. +Accepts either a parameter dictionary or a list of (parameter, value) pairs. +""" +if isinstance(args[0], dict): +self.baseOn(*args[0].items()) +else: +for (param, value) in args: +self.addGrid(param, [value]) + +return self + +def build(self): +""" +Builds and returns all combinations of parameters specified +by the param grid. +""" +param_maps = [{}] +for (param, values) in self._param_grid.items(): --- End diff -- Consider doing this [dict(zip(self._param_grid.keys(), prod)) for prod in itertools.product(*self._param_grid.values())] To avoid the overhead of lots of dictionary copies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2213][SQL] Sort Merge Join
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/3173#issuecomment-74416823 Hi, this looks great! Is there a reason why sort based join is not in spark core, only in spark SQL? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org