[GitHub] spark issue #14815: [SPARK-17244] Catalyst should not pushdown non-determini...
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/14815 cc @hvanhovell @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on the issue: https://github.com/apache/spark/pull/14659 Hi, @srowen Could you please review this PR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14815: [SPARK-17244] Catalyst should not pushdown non-de...
GitHub user sameeragarwal opened a pull request: https://github.com/apache/spark/pull/14815 [SPARK-17244] Catalyst should not pushdown non-deterministic join conditions ## What changes were proposed in this pull request? Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that. ## How was this patch tested? A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sameeragarwal/spark constraint-inputfile Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14815.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14815 commit 95150970d7e5a71d9271a209a8ee453ce20f8097 Author: Sameer AgarwalDate: 2016-08-24T19:37:33Z Joins should not pushdown non-deterministic conditions commit 6728fc31bab1fd53e1005f892496ec61b6d22cd0 Author: Sameer Agarwal Date: 2016-08-25T20:45:32Z unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14637: [SPARK-16967] move mesos to module
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14637 **[Test build #64435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64435/consoleFull)** for PR 14637 at commit [`09f3197`](https://github.com/apache/spark/commit/09f3197e7cac9a45315bf5bdaed57c97bcd0e46d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14814: [SPARK-17242][Document]Update links of external dstream ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14814 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64434/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14814: [SPARK-17242][Document]Update links of external dstream ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14814 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14814: [SPARK-17242][Document]Update links of external dstream ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14814 **[Test build #64434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64434/consoleFull)** for PR 14814 at commit [`17bb37e`](https://github.com/apache/spark/commit/17bb37e529b69823858d3e5edb0891a1ba6c9205). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14814: [SPARK-17242][Document]Update links of external dstream ...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/14814 /cc @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14637: [SPARK-16967] move mesos to module
Github user mgummelt commented on the issue: https://github.com/apache/spark/pull/14637 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/8880#discussion_r76318724 --- Diff: core/src/main/scala/org/apache/spark/security/CryptoStreamUtils.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.security + +import java.io.{InputStream, OutputStream} +import java.util.Properties +import javax.crypto.spec.{IvParameterSpec, SecretKeySpec} + +import org.apache.commons.crypto.random._ +import org.apache.commons.crypto.stream._ +import org.apache.hadoop.io.Text + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.internal.config._ + +/** + * A util class for manipulating IO encryption and decryption streams. + */ +private[spark] object CryptoStreamUtils { + /** + * Constants and variables for spark IO encryption + */ + val SPARK_IO_TOKEN = new Text("SPARK_IO_TOKEN") + + // The initialization vector length in bytes. + val IV_LENGTH_IN_BYTES = 16 + // The prefix of IO encryption related configurations in Spark configuration. + val SPARK_IO_ENCRYPTION_COMMONS_CONFIG_PREFIX = "spark.io.encryption.commons.config." + // The prefix for the configurations passing to Apache Commons Crypto library. + val COMMONS_CRYPTO_CONF_PREFIX = "commons.crypto." + + /** + * Helper method to wrap [[OutputStream]] with [[CryptoOutputStream]] for encryption. + */ + def createCryptoOutputStream( + os: OutputStream, + sparkConf: SparkConf): OutputStream = { +val properties = toCryptoConf(sparkConf, SPARK_IO_ENCRYPTION_COMMONS_CONFIG_PREFIX, + COMMONS_CRYPTO_CONF_PREFIX) +val iv = createInitializationVector(properties) +os.write(iv) +val credentials = SparkHadoopUtil.get.getCurrentUserCredentials() +val key = credentials.getSecretKey(SPARK_IO_TOKEN) +val transformationStr = sparkConf.get(IO_CRYPTO_CIPHER_TRANSFORMATION) +new CryptoOutputStream(transformationStr, properties, os, + new SecretKeySpec(key, "AES"), new IvParameterSpec(iv)) + } + + /** + * Helper method to wrap [[InputStream]] with [[CryptoInputStream]] for decryption. + */ + def createCryptoInputStream( + is: InputStream, + sparkConf: SparkConf): InputStream = { +val properties = toCryptoConf(sparkConf, SPARK_IO_ENCRYPTION_COMMONS_CONFIG_PREFIX, + COMMONS_CRYPTO_CONF_PREFIX) +val iv = new Array[Byte](IV_LENGTH_IN_BYTES) +is.read(iv, 0, iv.length) +val credentials = SparkHadoopUtil.get.getCurrentUserCredentials() +val key = credentials.getSecretKey(SPARK_IO_TOKEN) +val transformationStr = sparkConf.get(IO_CRYPTO_CIPHER_TRANSFORMATION) +new CryptoInputStream(transformationStr, properties, is, + new SecretKeySpec(key, "AES"), new IvParameterSpec(iv)) + } + + /** + * Get Commons-crypto configurations from Spark configurations identified by prefix. + */ + def toCryptoConf( + conf: SparkConf, + sparkPrefix: String, + cryptoPrefix: String): Properties = { --- End diff -- nit: you don't need `sparkPrefix` and `cryptoPrefix` any more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14637: [SPARK-16967] move mesos to module
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14637 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14637: [SPARK-16967] move mesos to module
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14637 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64430/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14637: [SPARK-16967] move mesos to module
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14637 **[Test build #64430 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64430/consoleFull)** for PR 14637 at commit [`09f3197`](https://github.com/apache/spark/commit/09f3197e7cac9a45315bf5bdaed57c97bcd0e46d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14814: [SPARK-17242][Document]Update links of external dstream ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14814 **[Test build #64434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64434/consoleFull)** for PR 14814 at commit [`17bb37e`](https://github.com/apache/spark/commit/17bb37e529b69823858d3e5edb0891a1ba6c9205). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14814: [SPARK-17242][Document]Update links of external d...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/14814 [SPARK-17242][Document]Update links of external dstream projects ## What changes were proposed in this pull request? Updated links of external dstream projects. ## How was this patch tested? Just document changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark dstream-link Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14814.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14814 commit 17bb37e529b69823858d3e5edb0891a1ba6c9205 Author: Shixiong ZhuDate: 2016-08-25T20:16:34Z Update links of external dstream projects --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14813: [SPARK-17240][core] Make SparkConf serializable again.
Github user mgummelt commented on the issue: https://github.com/apache/spark/pull/14813 thanks! LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14813: [SPARK-17240][core] Make SparkConf serializable again.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14813 **[Test build #64433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64433/consoleFull)** for PR 14813 at commit [`45cf302`](https://github.com/apache/spark/commit/45cf3028e778f9685224612829814a108932242c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14637: [SPARK-16967] move mesos to module
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/14637 LGTM now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14777: [SPARK-17205] Literal.sql should handle Infinity ...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/14777#discussion_r76313519 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala --- @@ -251,8 +251,21 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression with case (v: Short, ShortType) => v + "S" case (v: Long, LongType) => v + "L" // Float type doesn't have a suffix -case (v: Float, FloatType) => s"CAST($v AS ${FloatType.sql})" -case (v: Double, DoubleType) => v + "D" +case (v: Float, FloatType) => + val castedValue = v match { +case _ if v.isNaN => "'NaN'" +case Float.PositiveInfinity => "'Infinity'" +case Float.NegativeInfinity => "'-Infinity'" +case _ => v + } + s"CAST($castedValue AS ${FloatType.sql})" +case (v: Double, DoubleType) => + v match { +case _ if v.isNaN => s"CAST('NaN' AS ${DoubleType.sql})" +case Double.PositiveInfinity => s"CAST('Infinity' AS ${DoubleType.sql})" +case Double.NegativeInfinity => s"CAST('-Infinity' AS ${DoubleType.sql})" +case _ => v + "D" + } case (v: Decimal, t: DecimalType) => s"CAST($v AS ${t.sql})" --- End diff -- According to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-FloatingPointTypes: > Floating point literals are assumed to be DOUBLE. Scientific notation is not yet supported. However, the professed lack of support for scientific notation seems to be contradicted by https://issues.apache.org/jira/browse/HIVE-2536 and manual tests. Here's a test query which demonstrates the precision issues in decimal literals: ``` SELECT CAST(-0.06688467811848818630 as DECIMAL(38, 36)), CAST(-6.688467811848818630E-18 AS DECIMAL(38, 36)) ``` In Hive, these both behave equivalently: both forms of the number are interpreted as double so we lose precision and both cases wind up as `0.06688467811848818` (with the final three digits lost). In Spark 2.0, the first expanded form is parsed as a decimal literal, while the scientific notation form is parsed as a double, so the expanded form correctly preserves the decimal while the scientific notation causes precision loss (as in Hive). I think there's two possible fixes here: we could either emit the fully-expanded form or could update Spark's parser to treat scientific notation floating point literals as decimals. From a consistency point, I'm in favor of the latter approach because I don't think it makes sense for `1.1` and `1.1e0` to be treated differently. Given all of this, I think that it would certainly be _safe_ to emit fully-expanded forms of the decimal but I'm not sure if this is the optimal fix because it doesn't resolve inconsistencies between Spark and Hive and results in really ugly, hard-to-read expressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14813: [SPARK-17240][core] Make SparkConf serializable a...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/14813 [SPARK-17240][core] Make SparkConf serializable again. Make the config reader transient, and initialize it lazily so that serialization works with both java and kryo (and hopefully any other custom serializer). Added unit test to make sure SparkConf remains serializable and the reader works with both built-in serializers. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-17240 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14813.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14813 commit 45cf3028e778f9685224612829814a108932242c Author: Marcelo VanzinDate: 2016-08-25T19:49:52Z [SPARK-17240][core] Make SparkConf serializable again. Make the config reader transient, and initialize it lazily so that serialization works with both java and kryo (and hopefully any other custom serializer). Added unit test to make sure SparkConf remains serializable and the reader works with both built-in serializers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14812: [SPARK-17237][SQL] Remove unnecessary backticks in a piv...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14812 **[Test build #64432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64432/consoleFull)** for PR 14812 at commit [`530d5c0`](https://github.com/apache/spark/commit/530d5c03b414d9743944d532ecb9e9bd1c0bf5a5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14812: [SPARK-17237][SQL] Remove unnecessary backticks i...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/14812 [SPARK-17237][SQL] Remove unnecessary backticks in a pivot result schema ## What changes were proposed in this pull request? A schema of pivot results has nested backticks (e.g. \`3_count(\`c\`)\`). Since `Dataset#resolve` cannot handle the nested backticks, these column references fail after pivoting. This pr is to remove the unnecessary backticks. ## How was this patch tested? Added a test in `DataFrameAggregateSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-17237 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14812.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14812 commit 530d5c03b414d9743944d532ecb9e9bd1c0bf5a5 Author: Takeshi YAMAMURODate: 2016-08-25T19:17:50Z Fix a bug to handle missing data after pivoting --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14176 @davies I guess there is still benefit to make it public? If the user knows that their workload would always run faster with single-level, e.g., many distinct keys. I thought about `spark.sql.codegen.aggregate.map.fast.enable` or `spark.sql.codegen.aggregate.map.codegen.enable`, but none of them captures the fact that the biggest distinction is the two level design. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14811: [SPARK-17231][CORE] Avoid building debug or trace log me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14811 **[Test build #64431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64431/consoleFull)** for PR 14811 at commit [`e44d943`](https://github.com/apache/spark/commit/e44d94316e1641ea7db34efab5f0d669090d2599). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14798: [SPARK-17231][CORE] Avoid building debug or trace log me...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14798 @zsxwing PR #14811 is a backport of this PR to `branch-2.0`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14811: [SPARK-17231][CORE] Avoid building debug or trace...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/14811 [SPARK-17231][CORE] Avoid building debug or trace log messages unless This is simply a backport of #14798 to `branch-2.0`. This backport omits the change to `ExternalShuffleBlockHandler.java`. In `branch-2.0`, that file does not contain the log message that was patched in `master`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-17231-logging_perf_improvements-2.0_backport Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14811.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14811 commit e44d94316e1641ea7db34efab5f0d669090d2599 Author: Michael AllmanDate: 2016-08-25T19:06:45Z [SPARK-17231][CORE] Avoid building debug or trace log messages unless the respective log level is enabled --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14774: [SparkR][BUILD]:ignore cran-check.out under R fol...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14774 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14777: [SPARK-17205] Literal.sql should handle Infinity ...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/14777#discussion_r76305691 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala --- @@ -251,8 +251,21 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression with case (v: Short, ShortType) => v + "S" case (v: Long, LongType) => v + "L" // Float type doesn't have a suffix -case (v: Float, FloatType) => s"CAST($v AS ${FloatType.sql})" -case (v: Double, DoubleType) => v + "D" +case (v: Float, FloatType) => + val castedValue = v match { +case _ if v.isNaN => "'NaN'" +case Float.PositiveInfinity => "'Infinity'" +case Float.NegativeInfinity => "'-Infinity'" +case _ => v + } + s"CAST($castedValue AS ${FloatType.sql})" +case (v: Double, DoubleType) => + v match { +case _ if v.isNaN => s"CAST('NaN' AS ${DoubleType.sql})" +case Double.PositiveInfinity => s"CAST('Infinity' AS ${DoubleType.sql})" +case Double.NegativeInfinity => s"CAST('-Infinity' AS ${DoubleType.sql})" +case _ => v + "D" + } case (v: Decimal, t: DecimalType) => s"CAST($v AS ${t.sql})" --- End diff -- Actually, let me go ahead and quickly confirm whether Hive will support full expansion... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14537#discussion_r76305079 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -237,21 +237,27 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log new Path(metastoreRelation.catalogTable.storage.locationUri.get), partitionSpec) -val inferredSchema = if (fileType.equals("parquet")) { - val inferredSchema = -defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) - inferredSchema.map { inferred => -ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, inferred) - }.getOrElse(metastoreSchema) -} else { - defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()).get +val schema = fileType match { + case "parquet" => +val inferredSchema = + defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) + +// For Parquet, get correct schema by merging Metastore schema data types --- End diff -- I think we have a test. @liancheng should have more info. But, one clarification is that this merging is based on column name (we do take care the case sensitivity issue though). So, if you want to name a column with another name, I think it is not doable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14798: [SPARK-17231][CORE] Avoid building debug or trace...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14798 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14798: [SPARK-17231][CORE] Avoid building debug or trace log me...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14798 Will do --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14798: [SPARK-17231][CORE] Avoid building debug or trace log me...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/14798 @mallman It has some conflicts with 2.0. Could you submit another PR for branch 2.0, please? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14798: [SPARK-17231][CORE] Avoid building debug or trace log me...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/14798 LGTM. I was just thinking to work on this yesterday! Thanks, merging to master and 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14637: [SPARK-16967] move mesos to module
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14637 **[Test build #64430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64430/consoleFull)** for PR 14637 at commit [`09f3197`](https://github.com/apache/spark/commit/09f3197e7cac9a45315bf5bdaed57c97bcd0e46d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8880#discussion_r76298874 --- Diff: yarn/src/test/scala/org/apache/spark/security/IOEncryptionSuite.scala --- @@ -0,0 +1,332 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.security --- End diff -- This is still in the "yarn" module. Weren't you going to move it to "core"? (As in the physical location of the file, not the scala package name.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8880#discussion_r76298505 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -413,6 +414,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true") +if (_conf.get(IO_ENCRYPTION_ENABLED) && !SparkHadoopUtil.get.isYarnMode()) { + throw new SparkException("IO encryption is only supported in YARN mode, please disable it " + +"by setting spark.io.encryption.enabled to false") --- End diff -- nit: use `${IO_ENCRYPTION_ENABLED.key}` instead of the hardcoded key name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14239 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64427/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14239 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14239 **[Test build #64427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64427/consoleFull)** for PR 14239 at commit [`190d7fa`](https://github.com/apache/spark/commit/190d7fa8e8e2b0795e12eebba568be7428647f68). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13852: [SPARK-16200][SQL] Rename AggregateFunction#suppo...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/13852#discussion_r76292731 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala --- @@ -45,7 +45,7 @@ abstract class Collect extends ImperativeAggregate { override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) - override def supportsPartial: Boolean = false + override def forceSortAggregate: Boolean = true --- End diff -- yea. either way, it seems partial agg. becomes meaningless in future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14710 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14710 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64429/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14637: [SPARK-16967] move mesos to module
Github user mgummelt commented on a diff in the pull request: https://github.com/apache/spark/pull/14637#discussion_r76292137 --- Diff: dev/create-release/release-build.sh --- @@ -186,12 +186,13 @@ if [[ "$1" == "package" ]]; then # We increment the Zinc port each time to avoid OOM's and other craziness if multiple builds # share the same Zinc server. - make_binary_release "hadoop2.3" "-Psparkr -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn" "3033" & - make_binary_release "hadoop2.4" "-Psparkr -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn" "3034" & - make_binary_release "hadoop2.6" "-Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn" "3035" & - make_binary_release "hadoop2.7" "-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn" "3036" & - make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn" "3037" & - make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn" "3038" & + FLAGS="-Psparkr -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn -Pmesos" + make_binary_release "hadoop2.3" "$FLAGS" "3033" & + make_binary_release "hadoop2.4" "$FLAGS" "3034" & --- End diff -- ah, yea. fixing... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14710 **[Test build #64429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64429/consoleFull)** for PR 14710 at commit [`380291b`](https://github.com/apache/spark/commit/380291b7122aaf1fab461a07d72f0c285696c967). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12004 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64428/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12004 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12004 **[Test build #64428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64428/consoleFull)** for PR 12004 at commit [`b25d497`](https://github.com/apache/spark/commit/b25d49701b4015b49efc6c89734301525d803524). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14637: [SPARK-16967] move mesos to module
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/14637#discussion_r76291436 --- Diff: dev/create-release/release-build.sh --- @@ -186,12 +186,13 @@ if [[ "$1" == "package" ]]; then # We increment the Zinc port each time to avoid OOM's and other craziness if multiple builds # share the same Zinc server. - make_binary_release "hadoop2.3" "-Psparkr -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn" "3033" & - make_binary_release "hadoop2.4" "-Psparkr -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn" "3034" & - make_binary_release "hadoop2.6" "-Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn" "3035" & - make_binary_release "hadoop2.7" "-Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn" "3036" & - make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn" "3037" & - make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn" "3038" & + FLAGS="-Psparkr -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn -Pmesos" + make_binary_release "hadoop2.3" "$FLAGS" "3033" & + make_binary_release "hadoop2.4" "$FLAGS" "3034" & --- End diff -- This is wrong now; "FLAGS" enables "-Phadoop-2.3" when here it shuold be "-Phadoop-2.4" (and matching versions in the lines below). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13852: [SPARK-16200][SQL] Rename AggregateFunction#suppo...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13852#discussion_r76291119 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala --- @@ -45,7 +45,7 @@ abstract class Collect extends ImperativeAggregate { override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) - override def supportsPartial: Boolean = false + override def forceSortAggregate: Boolean = true --- End diff -- oh, after changing this name, it will not show that we do not partial agg for this function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14753 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14753 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64426/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14753 **[Test build #64426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64426/consoleFull)** for PR 14753 at commit [`ca574e1`](https://github.com/apache/spark/commit/ca574e145543c6fc555220fa8080bf7fbe152ba5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14710 **[Test build #64429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64429/consoleFull)** for PR 14710 at commit [`380291b`](https://github.com/apache/spark/commit/380291b7122aaf1fab461a07d72f0c285696c967). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/14710 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8880#discussion_r76289131 --- Diff: yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnIOEncryptionSuite.scala --- @@ -0,0 +1,335 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.yarn + +import java.io._ +import java.nio.ByteBuffer +import java.security.PrivilegedExceptionAction +import java.util.{ArrayList => JArrayList, LinkedList => JLinkedList, UUID} + +import scala.runtime.AbstractFunction1 + +import com.google.common.collect.HashMultiset +import com.google.common.io.ByteStreams +import org.apache.hadoop.security.{Credentials, UserGroupInformation} +import org.junit.Assert.assertEquals +import org.mockito.Mock +import org.mockito.MockitoAnnotations +import org.mockito.invocation.InvocationOnMock +import org.mockito.stubbing.Answer +import org.mockito.Answers.RETURNS_SMART_NULLS +import org.mockito.Matchers.{eq => meq, _} +import org.mockito.Mockito._ +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach, Matchers} + +import org.apache.spark._ +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.executor.{ShuffleWriteMetrics, TaskMetrics} +import org.apache.spark.internal.config._ +import org.apache.spark.io.CompressionCodec +import org.apache.spark.memory.{TaskMemoryManager, TestMemoryManager} +import org.apache.spark.network.buffer.NioManagedBuffer +import org.apache.spark.network.util.LimitedInputStream +import org.apache.spark.security.CryptoStreamUtils +import org.apache.spark.serializer._ +import org.apache.spark.shuffle._ +import org.apache.spark.shuffle.sort.{SerializedShuffleHandle, UnsafeShuffleWriter} +import org.apache.spark.storage._ +import org.apache.spark.util.Utils + +private[spark] class YarnIOEncryptionSuite extends SparkFunSuite with Matchers with --- End diff -- > Do you mean we need not unset this ENV variable in the tear down block? I mean that if you change the check in SparkContext to not throw an exception when "spark.testing" is set, you shouldn't need to set/unset "SPARK_YARN_MODE" in the test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14801: [SPARK-17234] [SQL] Table Existence Checking when Index ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14801 Sure, will revert it back and use the existing `AnalysisException`. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14809: [SPARK-17238][SQL] simplify the logic for converting dat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14809 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64425/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14809: [SPARK-17238][SQL] simplify the logic for converting dat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14809 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14809: [SPARK-17238][SQL] simplify the logic for converting dat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14809 **[Test build #64425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64425/consoleFull)** for PR 14809 at commit [`915d2b5`](https://github.com/apache/spark/commit/915d2b5a1dd8c26a37d0b99ba0503a0d95b6f3f3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14777: [SPARK-17205] Literal.sql should handle Infinity ...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/14777#discussion_r76284138 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala --- @@ -251,8 +251,21 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression with case (v: Short, ShortType) => v + "S" case (v: Long, LongType) => v + "L" // Float type doesn't have a suffix -case (v: Float, FloatType) => s"CAST($v AS ${FloatType.sql})" -case (v: Double, DoubleType) => v + "D" +case (v: Float, FloatType) => + val castedValue = v match { +case _ if v.isNaN => "'NaN'" +case Float.PositiveInfinity => "'Infinity'" +case Float.NegativeInfinity => "'-Infinity'" +case _ => v + } + s"CAST($castedValue AS ${FloatType.sql})" +case (v: Double, DoubleType) => + v match { +case _ if v.isNaN => s"CAST('NaN' AS ${DoubleType.sql})" +case Double.PositiveInfinity => s"CAST('Infinity' AS ${DoubleType.sql})" +case Double.NegativeInfinity => s"CAST('-Infinity' AS ${DoubleType.sql})" +case _ => v + "D" + } case (v: Decimal, t: DecimalType) => s"CAST($v AS ${t.sql})" --- End diff -- Hmmm as discussed, that's going to look very ugly but might be more compatible with Postgres and won't be lossy for very precise decimals. I say that we defer to followup for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14794: [SPARK-15083][WEB UI] History Server can OOM due ...
Github user ajbozarth closed the pull request at: https://github.com/apache/spark/pull/14794 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14794: [SPARK-15083][WEB UI] History Server can OOM due to unli...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/14794 Thanks @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14774: [SparkR][BUILD]:ignore cran-check.out under R folder
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14774 LGTM. Thanks @wangmiao1981 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14794: [SPARK-15083][WEB UI] History Server can OOM due to unli...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/14794 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14798: [SPARK-17231][CORE] Avoid building debug or trace log me...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14798 I focused mainly on trace and debug logging. I didn't do much with errors or warnings, especially where exceptions are logged. I'm assuming these are less frequent, and the cost of building those log messages is insignificant compared to the circumstances which called for them in the first place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/14785 Please also add test cases for matrices. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14452 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64422/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14452 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14810: Branch 1.6
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14810 Looks like an error -- close this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14452 **[Test build #64422 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64422/consoleFull)** for PR 14452 at commit [`6cb40f1`](https://github.com/apache/spark/commit/6cb40f12e074e0350aa01778c955b35631160858). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14750 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14750 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64421/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14750 **[Test build #64421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64421/consoleFull)** for PR 14750 at commit [`5b41a39`](https://github.com/apache/spark/commit/5b41a3973abbe25bccbeaa2718bb6ef209303bee). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14617 @jerryshao The UI changes look great. I have not had a chance to scrutinize the source changes. Hopefully we can get someone else to help review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user f7753 commented on the issue: https://github.com/apache/spark/pull/14239 @tgravescs Thank you. Currently, I'm not load all the data into memory, I use parameters `spark.shuffle.prepare.open ` to switch this mechanism off/on and `spark.shuffle.prepare.count` to control the block number to cache. So here gives the user the privilege to control the MEM used for the pre-fetch block based on their machine conditions. OS cache may do not have much impact on this(If my understanding is wrong, please correct me, thanks), since the shuffle block produced by map side will not be read more than one time in a normal job. Once the shuffle block consumed by the reduce side, it would be of no use, so it may be in the write buffer. If there is enough memory, this would not make the reading process more slow, and if not, we can use the limited memory to pre load the data. While transfer process succeed, release the mem buffer to load the data the next `FetchRequest` contains, until all the data has been send to the reduce side. I have implement this and tested based on the branch 1.4 and 1.6, using Intel Hibench4.0 terasort 1TB data size, I got about 30% performance enhancements, on a cluster which has 5 node, each node has 96GB Memï¼CPU is Xeon E5 v3 , 7200RPM Disk. Here we may search some paper and refer to them to make it more consummate . e.g. âHPMR: Prefetching and pre-shuffling in shared MapReduce computation environment â Thanks for you feedback, any work you want me to co-operate would be my pleasure, I love Spark so much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...
Github user davies commented on the issue: https://github.com/apache/spark/pull/14176 Can we make this `spark.sql.codegen.aggregate.map.twolevel.enable` internal? otherwise we should have a better name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14810: Branch 1.6
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14810 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14810: Branch 1.6
GitHub user sujan121 opened a pull request: https://github.com/apache/spark/pull/14810 Branch 1.6 ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14810.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14810 commit 7482c7b5aba5b649510bbb8886bbf2b44f86f543 Author: Shixiong ZhuDate: 2016-01-18T23:38:03Z [SPARK-12814][DOCUMENT] Add deploy instructions for Python in flume integration doc This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc. Author: Shixiong Zhu Closes #10746 from zsxwing/flume-doc. (cherry picked from commit a973f483f6b819ed4ecac27ff5c064ea13a8dd71) Signed-off-by: Tathagata Das commit d43704d7fc6a5e9da4968b1dafa8d4b1c341ee8d Author: Shixiong Zhu Date: 2016-01-19T00:50:05Z [SPARK-12894][DOCUMENT] Add deploy instructions for Python in Kinesis integration doc This PR added instructions to get Kinesis assembly jar for Python users in the Kinesis integration page like Kafka doc. Author: Shixiong Zhu Closes #10822 from zsxwing/kinesis-doc. (cherry picked from commit 721845c1b64fd6e3b911bd77c94e01dc4e5fd102) Signed-off-by: Tathagata Das commit 68265ac23e20305474daef14bbcf874308ca8f5a Author: Wenchen Fan Date: 2016-01-19T05:20:19Z [SPARK-12841][SQL][BRANCH-1.6] fix cast in filter In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like filter, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name. backport https://github.com/apache/spark/pull/10781 to 1.6 Author: Wenchen Fan Closes #10819 from cloud-fan/bug. commit 30f55e5232d85fd070892444367d2bb386dfce13 Author: proflin Date: 2016-01-19T08:15:43Z [SQL][MINOR] Fix one little mismatched comment according to the codes in interface.scala Author: proflin Closes #10824 from proflin/master. (cherry picked from commit c00744e60f77edb238aff1e30b450dca65451e91) Signed-off-by: Reynold Xin commit 962e618ec159f8cd26543f42b2ce484fd5a5d8c5 Author: Wojciech Jurczyk Date: 2016-01-19T09:36:45Z [MLLIB] Fix CholeskyDecomposition assertion's message Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method. Author: Wojciech Jurczyk Closes #10818 from wjur/wjur/rename_error_message. (cherry picked from commit ebd9ce0f1f55f7d2d3bd3b92c4b0a495c51ac6fd) Signed-off-by: Sean Owen commit 40fa21856aded0e8b0852cdc2d8f8bc577891908 Author: Josh Rosen Date: 2016-01-21T00:10:28Z [SPARK-12921] Use SparkHadoopUtil reflection in SpecificParquetRecordReaderBase It looks like there's one place left in the codebase, SpecificParquetRecordReaderBase, where we didn't use SparkHadoopUtil's reflective accesses of TaskAttemptContext methods, which could create problems when using a single Spark artifact with both Hadoop 1.x and 2.x. Author: Josh Rosen Closes #10843 from JoshRosen/SPARK-12921. commit b5d7dbeb3110a11716f6642829f4ea14868ccc8a Author: Liang-Chi Hsieh Date: 2016-01-22T02:55:28Z [SPARK-12747][SQL] Use correct type name for Postgres JDBC's real array https://issues.apache.org/jira/browse/SPARK-12747 Postgres JDBC driver uses "FLOAT4" or "FLOAT8" not "real". Author: Liang-Chi Hsieh Closes #10695 from viirya/fix-postgres-jdbc. (cherry picked from commit 55c7dd031b8a58976922e469626469aa4aff1391) Signed-off-by: Reynold Xin commit
[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14785 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14785 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64424/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14785 **[Test build #64424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64424/consoleFull)** for PR 14785 at commit [`1ec924c`](https://github.com/apache/spark/commit/1ec924cf8ca1bbe68fd5e700550dfa422e445b59). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12004 **[Test build #64428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64428/consoleFull)** for PR 12004 at commit [`b25d497`](https://github.com/apache/spark/commit/b25d49701b4015b49efc6c89734301525d803524). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14452 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14452 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64420/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14452 **[Test build #64420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64420/consoleFull)** for PR 14452 at commit [`6a8011b`](https://github.com/apache/spark/commit/6a8011bc9dfa3289e98a5efe65e92704b85bb4b5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14239 **[Test build #64427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64427/consoleFull)** for PR 14239 at commit [`190d7fa`](https://github.com/apache/spark/commit/190d7fa8e8e2b0795e12eebba568be7428647f68). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14809: [SPARK-17238][SQL] simplify the logic for converting dat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14809 **[Test build #64425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64425/consoleFull)** for PR 14809 at commit [`915d2b5`](https://github.com/apache/spark/commit/915d2b5a1dd8c26a37d0b99ba0503a0d95b6f3f3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14753 @hvanhovell This is supposed to work with window functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14753 **[Test build #64426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64426/consoleFull)** for PR 14753 at commit [`ca574e1`](https://github.com/apache/spark/commit/ca574e145543c6fc555220fa8080bf7fbe152ba5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14809: [SPARK-17238][SQL] simplify the logic for converting dat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14809 cc @yhuai @gatorsmile @liancheng @clockfly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14809: [SPARK-17238][SQL] simplify the logic for convert...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/14809 [SPARK-17238][SQL] simplify the logic for converting data source table into hive compatible format ## What changes were proposed in this pull request? Previously we have 2 conditions to decide whether a data source table is hive-compatible: 1. the data source is file-based and has a corresponding Hive serde 2. have a `path` entry in data source options/storage properties However, if condition 1 is true, condition 2 must be true too, as we will put the default table path into data source options/storage properties for managed data source tables. There is also a potential issue: we will set the `locationUri` even for managed table. This PR removes the condition 2 and only set the `locationUri` for external data source tables. Note: this is also a first step to unify the `path` of data source tables and `locationUri` of hive serde tables. For hive serde tables, `locationUri` is only set for external table. For data source tables, `path` is always set. We can make them consistent after this PR. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark minor2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14809.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14809 commit 915d2b5a1dd8c26a37d0b99ba0503a0d95b6f3f3 Author: Wenchen FanDate: 2016-08-25T15:11:23Z simplify the logic for converting data source table into hive compatible format --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14786: [SPARK-17212][SQL] TypeCoercion supports widening conver...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14786 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14786: [SPARK-17212][SQL] TypeCoercion supports widening conver...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14786 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64419/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14786: [SPARK-17212][SQL] TypeCoercion supports widening conver...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14786 **[Test build #64419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64419/consoleFull)** for PR 14786 at commit [`d035eb3`](https://github.com/apache/spark/commit/d035eb3ba725250f6238a5b8a189b6749065cf95). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14753: [SPARK-17187][SQL] Supports using arbitrary Java ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14753#discussion_r76264528 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/TypedImperativeAggregateSuite.scala --- @@ -0,0 +1,300 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.{ByteArrayInputStream, ByteArrayOutputStream, DataInputStream, DataOutputStream} + +import org.apache.spark.sql.TypedImperativeAggregateSuite.TypedMax +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{BoundReference, Expression, GenericMutableRow, SpecificMutableRow} +import org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate +import org.apache.spark.sql.execution.aggregate.SortAggregateExec +import org.apache.spark.sql.expressions.Window +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType, IntegerType, LongType} + +class TypedImperativeAggregateSuite extends QueryTest with SharedSQLContext { + + import testImplicits._ + + private val random = new java.util.Random() + + private val data = (0 until 1000).map { _ => +(random.nextInt(10), random.nextInt(100)) + } + + test("aggregate with object aggregate buffer") { +val agg = new TypedMax(BoundReference(0, IntegerType, nullable = false)) + +val group1 = (0 until data.length / 2) +val group1Buffer = agg.createAggregationBuffer() +group1.foreach { index => + val input = InternalRow(data(index)._1, data(index)._2) + agg.update(group1Buffer, input) +} + +val group2 = (data.length / 2 until data.length) +val group2Buffer = agg.createAggregationBuffer() +group2.foreach { index => + val input = InternalRow(data(index)._1, data(index)._2) + agg.update(group2Buffer, input) +} + +val mergeBuffer = agg.createAggregationBuffer() +agg.merge(mergeBuffer, group1Buffer) +agg.merge(mergeBuffer, group2Buffer) + +assert(mergeBuffer.value == data.map(_._1).max) +assert(agg.eval(mergeBuffer) == data.map(_._1).max) + +// Tests low level eval(row: InternalRow) API. +val row = new GenericMutableRow(Array(mergeBuffer): Array[Any]) + +// Evaluates directly on row consist of aggregation buffer object. +assert(agg.eval(row) == data.map(_._1).max) + } + + test("supports SpecificMutableRow as mutable row") { +val aggregationBufferSchema = Seq(IntegerType, LongType, BinaryType, IntegerType) +val aggBufferOffset = 2 +val buffer = new SpecificMutableRow(aggregationBufferSchema) +val agg = new TypedMax(BoundReference(ordinal = 1, dataType = IntegerType, nullable = false)) + .withNewMutableAggBufferOffset(aggBufferOffset) + +agg.initialize(buffer) +data.foreach { kv => + val input = InternalRow(kv._1, kv._2) + agg.update(buffer, input) +} +assert(agg.eval(buffer) == data.map(_._2).max) + } + + test("dataframe aggregate with object aggregate buffer, should not use HashAggregate") { +val df = data.toDF("a", "b") +val max = new TypedMax($"a".expr) + +// Always uses SortAggregateExec +val sparkPlan = df.select(Column(max.toAggregateExpression())).queryExecution.sparkPlan +assert(sparkPlan.isInstanceOf[SortAggregateExec]) + } + + test("dataframe aggregate with object aggregate buffer, no group by") { +val df = data.toDF("key", "value").coalesce(2) +val query = df.select(typedMax($"key"), count($"key"), typedMax($"value"), count($"value")) +val maxKey = data.map(_._1).max +val countKey = data.size +val maxValue = data.map(_._2).max +val countValue = data.size +val expected = Seq(Row(maxKey,
[GitHub] spark pull request #14753: [SPARK-17187][SQL] Supports using arbitrary Java ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14753#discussion_r76263947 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,144 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * Aggregation function which allows **arbitrary** user-defined java object to be used as internal + * aggregation buffer object. + * + * {{{ + *aggregation buffer for normal aggregation function `avg` + *| + *v + * +--+---+---+ + * | sum1 (Long) | count1 (Long) | generic user-defined java objects | + * +--+---+---+ + * ^ + * | + *Aggregation buffer object for `TypedImperativeAggregate` aggregation function + * }}} + * + * Work flow (Partial mode aggregate at Mapper side, and Final mode aggregate at Reducer side): + * + * Stage 1: Partial aggregate at Mapper side: + * + * 1. The framework calls `createAggregationBuffer(): T` to create an empty internal aggregation + * buffer object. + * 2. Upon each input row, the framework calls + * `update(buffer: T, input: InternalRow): Unit` to update the aggregation buffer object T. + * 3. After processing all rows of current group (group by key), the framework will serialize + * aggregation buffer object T to storage format (Array[Byte]) and persist the Array[Byte] + * to disk if needed. + * 4. The framework moves on to next group, until all groups have been processed. + * + * Shuffling exchange data to Reducer tasks... + * + * Stage 2: Final mode aggregate at Reducer side: + * + * 1. The framework calls `createAggregationBuffer(): T` to create an empty internal aggregation + * buffer object (type T) for merging. + * 2. For each aggregation output of Stage 1, The framework de-serializes the storage + * format (Array[Byte]) and produces one input aggregation object (type T). + * 3. For each input aggregation object, the framework calls `merge(buffer: T, input: T): Unit` + * to merge the input aggregation object into aggregation buffer object. + * 4. After processing all input aggregation objects of current group (group by key), the framework + * calls method `eval(buffer: T)` to generate the final output for this group. + * 5. The framework moves on to next group, until all groups have been processed. + * + * NOTE: SQL with TypedImperativeAggregate functions is planned in sort based aggregation, + * instead of hash based aggregation, as TypedImperativeAggregate use BinaryType as aggregation + * buffer's storage format, which is not supported by hash based aggregation. Hash based + * aggregation only support aggregation buffer of mutable types (like LongType, IntType that have + * fixed length and can be mutated in place in UnsafeRow) + */ +abstract class TypedImperativeAggregate[T] extends ImperativeAggregate { --- End diff -- `ImperativeAggregate` only defines the interface. It does not specify what are accepted buffer types, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14785 **[Test build #64424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64424/consoleFull)** for PR 14785 at commit [`1ec924c`](https://github.com/apache/spark/commit/1ec924cf8ca1bbe68fd5e700550dfa422e445b59). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14808: [SPARK-17156][ML][EXAMPLE] Add multiclass logistic regre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14808 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14808: [SPARK-17156][ML][EXAMPLE] Add multiclass logistic regre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14808 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64423/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14808: [SPARK-17156][ML][EXAMPLE] Add multiclass logistic regre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14808 **[Test build #64423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64423/consoleFull)** for PR 14808 at commit [`ba5a4e2`](https://github.com/apache/spark/commit/ba5a4e2cc14e253ea3465d887c3cf5a2a9d82a80). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class Params(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org