[GitHub] spark pull request #16540: Nullability udfs
Github user damnMeddlingKid closed the pull request at: https://github.com/apache/spark/pull/16540 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16540: Nullability udfs
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/16540 Nullability udfs ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/damnMeddlingKid/spark nullability_udfs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16540.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16540 commit 5083b08f511b852c66bc006c41656f0560a893d6 Author: Franklyn D'souza Date: 2017-01-10T21:46:10Z python changes commit fbe780f84226cf72de801ad7931fa1b4d1acd2e5 Author: Franklyn D'souza Date: 2017-01-10T22:49:44Z introduce nullability for scala udfs commit a00544c8709dce65d3514546348bdd0459dc25a3 Author: Franklyn D'souza Date: 2017-01-10T23:05:34Z add nullability to scala python udfs commit d85230263156684adfabecef20293477dd30e957 Author: Franklyn D'souza Date: 2017-01-10T23:21:25Z check for none in wrapped function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14164: [SPARK-16629] Allow comparisons between UDTs and ...
Github user damnMeddlingKid closed the pull request at: https://github.com/apache/spark/pull/14164 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14164: [SPARK-16629] Allow comparisons between UDTs and Datatyp...
Github user damnMeddlingKid commented on the issue: https://github.com/apache/spark/pull/14164 I've tested this successfully with int and timestamp types, but it doesn't seem to work with DecimalType. Anyone know what could be wrong ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14164: Allow comparisons between UDTs and Datatypes
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/14164 Allow comparisons between UDTs and Datatypes ## What changes were proposed in this pull request? Currently UDTs can not be compared to Datatypes even if their sqlTypes match. this leads to errors like this ``` In [12]: thresholded = df.filter(df['udt_time'] > threshold) --- AnalysisException Traceback (most recent call last) /Users/franklyndsouza/dev/starscream/bin/starscream in () > 1 thresholded = df.filter(df['tick_tock_est'] > threshold) AnalysisException: u"cannot resolve '(`tick_tock_est` > TIMESTAMP('2015-10-20 01:00:00.0'))' due to data typ mismatch: '(`tick_tock_est` > TIMESTAMP('2015-10-20 01:00:00.0'))' requires (boolean or tinyint or smallint or int or bigint or float or double or decimal or timestamp or date or string or binary) type, not pythonuserdefined" ``` This PR adds some comparisons that allow UDTs to be correctly compared to a Datatype. ## How was this patch tested? Built locally and tested in the pyspark repl. You can merge this pull request into a Git repository by running: $ git pull https://github.com/damnMeddlingKid/spark fix-df-filtering Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14164.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14164 commit d0d31ca18c49fd24476d8b7291cb16d5f346ee6e Author: Franklyn D'souza Date: 2016-07-12T22:17:25Z allow comparisons between UDTs and Datatypes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13717: [SPARK-15811] [SQL] fix the Python UDF in Scala 2.10
Github user damnMeddlingKid commented on the issue: https://github.com/apache/spark/pull/13717 Just to take a step back, Is the suite lacking coverage for this feature ?. This sort of thing should have been caught in the unit tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
Github user damnMeddlingKid closed the pull request at: https://github.com/apache/spark/pull/11333 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
Github user damnMeddlingKid commented on the pull request: https://github.com/apache/spark/pull/11333#issuecomment-187964146 Hoping to get this into 1.6.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/11333 [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. ## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below. ``` from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) ``` ## How was the this patch tested? Tested using two unit tests in sql/test.py and the DataFrameSuite. Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 You can merge this pull request into a Git repository by running: $ git pull https://github.com/damnMeddlingKid/spark udt-union-patch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11333.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11333 commit c14d1ba953ecbfa141887f801445ffa8ab280dee Author: Franklyn D'souza Date: 2016-02-23T21:48:20Z support unionAll for dataframes with UDT columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
Github user damnMeddlingKid closed the pull request at: https://github.com/apache/spark/pull/11330 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/11330 [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. ## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below. ``` from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) ``` ## How was the this patch tested? Tested using two unit tests in sql/test.py and the DataFrameSuite. Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 You can merge this pull request into a Git repository by running: $ git pull https://github.com/damnMeddlingKid/spark udt-union-all Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11330.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11330 commit 6f0f1d9e04a8db47e2f6f8fcfe9dea9de0f633da Author: Cheng Lian Date: 2016-01-25T23:05:05Z [SPARK-12934][SQL] Count-min sketch serialization This PR adds serialization support for `CountMinSketch`. A version number is added to version the serialized binary format. Author: Cheng Lian Closes #10893 from liancheng/cms-serialization. commit be375fcbd200fb0e210b8edcfceb5a1bcdbba94b Author: Wenchen Fan Date: 2016-01-26T00:23:59Z [SPARK-12879] [SQL] improve the unsafe row writing framework As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: **old version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms)Avg Rate(M/s) Relative Rate --- single long 2616.04 102.61 1.00 X single nullable long3032.5488.52 0.86 X primitive types 9121.0529.43 0.29 X nullable primitive types 12410.6021.63 0.21 X ``` **new version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms)Avg Rate(M/s) Relative Rate --- single long 1533.34 175.07 1.00 X single nullable long2306.73 116.37 0.66 X primitive types 8403.9331.94 0.18 X nullable primitive types 12448.3921.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan Closes #10809 from cloud-fan/unsafe-projection. commit 109061f7ad27225669cbe609ec38756b31d4e1b9 Author: Wenchen Fan Date: 2016-01-26T01:58:11Z [SPARK-12936][SQL] Initial bloom filter implementation This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java). Some difference from the design doc: * expose `bitSize` instead of `sizeInBytes` to us
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
Github user damnMeddlingKid commented on the pull request: https://github.com/apache/spark/pull/11279#issuecomment-187749779 @rxin any chance this will make it into 1.6.1 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
Github user damnMeddlingKid commented on the pull request: https://github.com/apache/spark/pull/11279#issuecomment-186738811 Yeah I think its just the order of the output. I've made the ordering more explicit now, i've run these tests on my local machine and they pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
Github user damnMeddlingKid commented on the pull request: https://github.com/apache/spark/pull/11279#issuecomment-186467863 That should be it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/11279 [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. ## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. ## How was the this patch tested? Tested using two unit tests in test.py and the DataFrameSuite. These tests fail without this patch with "AnalysisException: u"unresolved operator 'Union;"" Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 You can merge this pull request into a Git repository by running: $ git pull https://github.com/damnMeddlingKid/spark udt-union-all Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11279.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11279 commit fc8ea19bb4deebfc74bedc1d4092d9c9dd9ace00 Author: Franklyn D'souza Date: 2016-02-19T04:40:04Z support union all for UDT commit 2642f68ad67bf6d7110d0da9f19daad295695fd1 Author: Franklyn D'souza Date: 2016-02-19T21:46:12Z test unionAll for udt dfs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Kafka streaming
Github user damnMeddlingKid closed the pull request at: https://github.com/apache/spark/pull/10136 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Kafka streaming
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/10136 Kafka streaming You can merge this pull request into a Git repository by running: $ git pull https://github.com/Shopify/spark kafka_streaming Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10136.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10136 commit 854319e589c89b2b6b4a9d02916f6f748fc5680a Author: Fernando Otero (ZeoS) Date: 2015-01-08T20:42:54Z SPARK-5148 [MLlib] Make usersOut/productsOut storagelevel in ALS configurable Author: Fernando Otero (ZeoS) Closes #3953 from zeitos/storageLevel and squashes the following commits: 0f070b9 [Fernando Otero (ZeoS)] fix imports 6869e80 [Fernando Otero (ZeoS)] fix comment length 90c9f7e [Fernando Otero (ZeoS)] fix comment length 18a992e [Fernando Otero (ZeoS)] changing storage level commit d9cad94b1df0200207ba03fb0168373ccc3a8597 Author: Kousuke Saruta Date: 2015-01-08T21:43:09Z [SPARK-4973][CORE] Local directory in the driver of client-mode continues remaining even if application finished when external shuffle is enabled When we enables external shuffle service, local directories in the driver of client-mode continue remaining even if application has finished. I think local directories for drivers should be deleted. Author: Kousuke Saruta Closes #3811 from sarutak/SPARK-4973 and squashes the following commits: ad944ab [Kousuke Saruta] Fixed DiskBlockManager to cleanup local directory if it's the driver 43770da [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973 88feecd [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973 d99718e [Kousuke Saruta] Fixed SparkSubmit.scala and DiskBlockManager.scala in order to delete local directories of the driver of local-mode when external shuffle service is enabled commit b14068bf7b2dff450101d48a59e79761e3ca4eb2 Author: RJ Nowling Date: 2015-01-08T23:03:43Z [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to P... ...ySpark MLlib This is a follow up to PR3680 https://github.com/apache/spark/pull/3680 . Author: RJ Nowling Closes #3955 from rnowling/spark4891 and squashes the following commits: 1236a01 [RJ Nowling] Fix Python style issues 7a01a78 [RJ Nowling] Fix Python style issues 174beab [RJ Nowling] [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to PySpark MLlib commit 5a1b7a9c8a77b6d1ef5553490d0ccf291dfac06f Author: Marcelo Vanzin Date: 2015-01-09T01:15:13Z [SPARK-4048] Enhance and extend hadoop-provided profile. This change does a few things to make the hadoop-provided profile more useful: - Create new profiles for other libraries / services that might be provided by the infrastructure - Simplify and fix the poms so that the profiles are only activated while building assemblies. - Fix tests so that they're able to run when the profiles are activated - Add a new env variable to be used by distributions that use these profiles to provide the runtime classpath for Spark jobs and daemons. Author: Marcelo Vanzin Closes #2982 from vanzin/SPARK-4048 and squashes the following commits: 82eb688 [Marcelo Vanzin] Add a comment. eb228c0 [Marcelo Vanzin] Fix borked merge. 4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes. 371ebee [Marcelo Vanzin] Review feedback. 52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 322f882 [Marcelo Vanzin] Fix merge fail. f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 9640503 [Marcelo Vanzin] Cleanup child process log message. 115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom). e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile. 7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles. 1be73d4 [Marcelo Vanzin] Restore flume-provided profile. d1399ed [Marcelo Vanzin] Restore jetty dependency. 82a54b9 [Marcelo Vanzin] Remove unused profile. 5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles. 1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver. f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provi
[GitHub] spark pull request: New spark
Github user damnMeddlingKid closed the pull request at: https://github.com/apache/spark/pull/9342 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: New spark
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/9342 New spark You can merge this pull request into a Git repository by running: $ git pull https://github.com/Shopify/spark new_spark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9342.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9342 commit 60b922795d0d6a5e0db96c11416804153e307810 Author: Zhang, Liye Date: 2015-01-08T18:40:26Z [SPARK-4989][CORE] avoid wrong eventlog conf cause cluster down in standalone mode when enabling eventlog in standalone mode, if give the wrong configuration, the standalone cluster will down (cause master restart, lose connection with workers). How to reproduce: just give an invalid value to "spark.eventLog.dir", for example: spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2. This will throw illegalArgumentException, which will cause the Master restart. And the whole cluster is not available. Author: Zhang, Liye Closes #3824 from liyezhang556520/wrongConf4Cluster and squashes the following commits: 3c24d98 [Zhang, Liye] revert change with logwarning and excetption for FileNotFoundException 3c1ac2e [Zhang, Liye] change var to val a49c52f [Zhang, Liye] revert wrong modification 12eee85 [Zhang, Liye] add more message in log and on webUI 5c1fa33 [Zhang, Liye] cache exceptions when eventlog with wrong conf commit a9940b5a04c905698f17940669a161fcd414284f Author: Kousuke Saruta Date: 2015-01-08T19:35:56Z [Minor] Fix the value represented by spark.executor.id for consistency. The property `spark.executor.id` can represent both `driver` and `` for one driver. It's inconsistent. This issue is minor so I didn't file this in JIRA. Author: Kousuke Saruta Closes #3812 from sarutak/fix-driver-identifier and squashes the following commits: d885498 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-driver-identifier 4275663 [Kousuke Saruta] Fixed the value represented by spark.executor.id of local mode commit b4fb97df2cbdd743656e000fefe471406619220c Author: WangTaoTheTonic Date: 2015-01-08T19:45:42Z [SPARK-5130][Deploy]Take yarn-cluster as cluster mode in spark-submit https://issues.apache.org/jira/browse/SPARK-5130 Author: WangTaoTheTonic Closes #3929 from WangTaoTheTonic/SPARK-5130 and squashes the following commits: c490648 [WangTaoTheTonic] take yarn-cluster as cluster mode in spark-submit commit 31d67152c2cbbe2e076003b3ff0d0a7e2f801549 Author: Eric Moyer Date: 2015-01-08T19:55:23Z Document that groupByKey will OOM for large keys This pull request is my own work and I license it under Spark's open-source license. This contribution is an improvement to the documentation. I documented that the maximum number of values per key for groupByKey is limited by available RAM (see [Datablox][datablox link] and [the spark mailing list][list link]). Just saying that better performance is available is not sufficient. Sometimes you need to do a group-by - your operation needs all the items available in order to complete. This warning explains the problem. [datablox link]: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html [list link]: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11466.html Author: Eric Moyer Closes #3936 from RadixSeven/better-group-by-docs and squashes the following commits: 5b6f4e9 [Eric Moyer] groupByKey docs naming updates 238e81b [Eric Moyer] Doc that groupByKey will OOM for large keys commit 854319e589c89b2b6b4a9d02916f6f748fc5680a Author: Fernando Otero (ZeoS) Date: 2015-01-08T20:42:54Z SPARK-5148 [MLlib] Make usersOut/productsOut storagelevel in ALS configurable Author: Fernando Otero (ZeoS) Closes #3953 from zeitos/storageLevel and squashes the following commits: 0f070b9 [Fernando Otero (ZeoS)] fix imports 6869e80 [Fernando Otero (ZeoS)] fix comment length 90c9f7e [Fernando Otero (ZeoS)] fix comment length 18a992e [Fernando Otero (ZeoS)] changing storage level commit d9cad94b1df0200207ba03fb0168373ccc3a8597 Author: Kousuke Saruta Date: 2015-01-08T21:43:09Z [SPARK-4973][CORE] Local directory in the driver of client-mode continues remaining even if application finished when external shuffle is enabled When we enables external shuffle service, local directories in the driver of client-mode continue remaining even if application has finished. I
[GitHub] spark pull request: Update spark
Github user damnMeddlingKid closed the pull request at: https://github.com/apache/spark/pull/9341 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update spark
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/9341 Update spark You can merge this pull request into a Git repository by running: $ git pull https://github.com/Shopify/spark update_spark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9341.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9341 commit 4639372eb9325f466b78d074bfc24d8d5a93322e Author: Alex Angelini Date: 2015-10-19T17:07:39Z [SPARK-9643] Upgrade pyrolite to 4.9 Includes: https://github.com/irmen/Pyrolite/pull/23 which fixes datetimes with timezones. JoshRosen https://issues.apache.org/jira/browse/SPARK-9643 Author: Alex Angelini Closes #7950 from angelini/upgrade_pyrolite_up. commit 7b4cd3da570c098da5adef82d394c84d3df8d602 Author: Holden Karau Date: 2015-10-20T17:52:49Z [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9 Upgrade to Py4j0.9 Author: Holden Karau Author: Holden Karau Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9. commit 56c6e7846e00c7deacf8349a93e517c7ed496ee5 Author: Nick Evans Date: 2015-10-27T08:29:06Z [SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition from the Kafka Streaming API jerryshao tdas I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise. Instead of doing something like: ``` assert topic_and_partition_instance._topic == "foo" assert topic_and_partition_instance._partition == 0 ``` You can do something like: ``` assert topic_and_partition_instance == TopicAndPartition("foo", 0) ``` Before: ``` >>> from pyspark.streaming.kafka import TopicAndPartition >>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0) False ``` After: ``` >>> from pyspark.streaming.kafka import TopicAndPartition >>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0) True ``` I couldn't find any tests - am I missing something? Author: Nick Evans Closes #9236 from manygrams/topic_and_partition_equality. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Packserv
Github user damnMeddlingKid closed the pull request at: https://github.com/apache/spark/pull/9151 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Packserv
GitHub user damnMeddlingKid opened a pull request: https://github.com/apache/spark/pull/9151 Packserv You can merge this pull request into a Git repository by running: $ git pull https://github.com/Shopify/spark packserv Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9151.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9151 commit 60fde12bc4e824c1447db69f92387f35e9b67331 Author: hushan[è¡ç] Date: 2015-01-07T20:09:12Z [SPARK-5132][Core]Correct stage Attempt Id key in stageInfofromJson SPARK-5132: stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id Author: hushan[è¡ç] Closes #3932 from suyanNone/json-stage and squashes the following commits: 41419ab [hushan[è¡ç]] Correct stage Attempt Id key in stageInfofromJson commit 65c9e1022521053e130220802bbfddd1dba0733e Author: zsxwing Date: 2015-01-08T07:01:30Z [SPARK-5126][Core] Verify Spark urls before creating Actors so that invalid urls can crash the process. Because `actorSelection` will return `deadLetters` for an invalid path, Worker keeps quiet for an invalid master url. It's better to log an error so that people can find such problem quickly. This PR will check the url before sending to `actorSelection`, throw and log a SparkException for an invalid url. Author: zsxwing Closes #3927 from zsxwing/SPARK-5126 and squashes the following commits: 9d429ee [zsxwing] Create a utility method in Utils to parse Spark url; verify urls before creating Actors so that invalid urls can crash the process. 8286e51 [zsxwing] Check the url before sending to Akka and log the error if the url is invalid commit 536b82f9cb5535e57393eee401ebddad524aee26 Author: Shuo Xiang Date: 2015-01-08T07:22:37Z [SPARK-5116][MLlib] Add extractor for SparseVector and DenseVector Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we may use: vec match { case dv: DenseVector => val values = dv.values ... case sv: SparseVector => val indices = sv.indices val values = sv.values val size = sv.size ... } with extractor it is: vec match { case DenseVector(values) => ... case SparseVector(size, indices, values) => ... } Author: Shuo Xiang Closes #3919 from coderxiang/extractor and squashes the following commits: 359e8d5 [Shuo Xiang] merge master ca5fc3e [Shuo Xiang] merge master 0b1e190 [Shuo Xiang] use extractor for vectors in RowMatrix.scala e961805 [Shuo Xiang] use extractor for vectors in StandardScaler.scala c2bbdaf [Shuo Xiang] use extractor for vectors in IDFscala 8433922 [Shuo Xiang] use extractor for vectors in NaiveBayes.scala and Normalizer.scala d83c7ca [Shuo Xiang] use extractor for vectors in Vectors.scala 5523dad [Shuo Xiang] Add extractor for SparseVector and DenseVector commit 0114e817977782e2e9ae6eeb3d2719f5aa76148b Author: Sandy Ryza Date: 2015-01-08T17:25:43Z SPARK-5087. [YARN] Merge yarn.Client and yarn.ClientBase Author: Sandy Ryza Closes #3896 from sryza/sandy-spark-5087 and squashes the following commits: 65611d0 [Sandy Ryza] Review feedback 3294176 [Sandy Ryza] SPARK-5087. [YARN] Merge yarn.Client and yarn.ClientBase commit 46dca8c79d6de431a8088f1346ddd500d91a7203 Author: Takeshi Yamamuro Date: 2015-01-08T17:55:12Z [SPARK-4917] Add a function to convert into a graph with canonical edges in GraphOps Convert bi-directional edges into uni-directional ones instead of 'canonicalOrientation' in GraphLoader.edgeListFile. This function is useful when a graph is loaded as it is and then is transformed into one with canonical edges. It rewrites the vertex ids of edges so that srcIds are bigger than dstIds, and merges the duplicated edges. Author: Takeshi Yamamuro Closes #3760 from maropu/ConvertToCanonicalEdgesSpike and squashes the following commits: 7f8b580 [Takeshi Yamamuro] Add a function to convert into a graph with canonical edges in GraphOps commit 60b922795d0d6a5e0db96c11416804153e307810 Author: Zhang, Liye Date: 2015-01-08T18:40:26Z [SPARK-4989][CORE] avoid wrong eventlog conf cause cluster down in standalone mode when enabling eventlog in standalone mode, if give the wrong configuration, the standalone cluster will down (cause master restart, lose connection with workers). How to reproduce: ju