[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/9383#discussion_r43827568 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala --- @@ -762,15 +679,7 @@ class TungstenAggregationIterator( /** * Start processing input rows. */ - testFallbackStartsAt match { -case None => - processInputs() -case Some(fallbackStartsAt) => - // This is the testing path. processInputsWithControlledFallback is same as processInputs - // except that it switches to sort-based aggregation after `fallbackStartsAt` input rows - // have been processed. - processInputsWithControlledFallback(fallbackStartsAt) - } + processInputs(testFallbackStartsAt.getOrElse(Int.MaxValue)) --- End diff -- Each record needs 30+ bytes, it needs to have more than 60G memory for single task to trigger this spilling, I think that's fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-11217][ML] save/load for non-meta ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/9454#discussion_r43827580 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -592,7 +592,7 @@ trait Params extends Identifiable with Serializable { /** * Sets a parameter in the embedded param map. */ - protected final def set[T](param: Param[T], value: T): this.type = { + final def set[T](param: Param[T], value: T): this.type = { --- End diff -- it is not feasible to set an arbitrary param outside the instance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-11217][ML] save/load for non-meta ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9454#issuecomment-153532152 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-11217][ML] save/load for non-meta ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9454#issuecomment-153532178 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-11217][ML] save/load for non-meta ...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/9454 [WIP][SPARK-11217][ML] save/load for non-meta estimators and transformers This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes: * class name * uid * timestamp * paramMap The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases. ~~~scala instance.save.to("path") instance.save.options("overwrite" -> "true").to("path") instance.save.context(sqlContext).to("path") Instance.load.from("path") ~~~ The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params. TODOs: * [ ] Java test * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers cc @jkbradley You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark SPARK-11217 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9454.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9454 commit cd1c7eae3246f93b6ee4e03adfe57fdf1386 Author: Xiangrui Meng Date: 2015-11-03T18:56:22Z initial implementation commit df81d61f73c6a854913df638770f0b0409f046a3 Author: Xiangrui Meng Date: 2015-11-03T23:41:58Z update doc and test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9403#issuecomment-153532104 **[Test build #44975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44975/consoleFull)** for PR 9403 at commit [`7c1edd7`](https://github.com/apache/spark/commit/7c1edd74a93a8336a714e0b3e143a15daadaafc0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8984#issuecomment-153531457 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/9344#discussion_r43827121 --- Diff: core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.memory + +import javax.annotation.concurrent.GuardedBy + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.{TaskContext, Logging} +import org.apache.spark.storage.{MemoryStore, BlockStatus, BlockId} + +/** + * Performs bookkeeping for managing an adjustable-size pool of memory that is used for storage + * (caching). + * + * @param memoryManager a [[MemoryManager]] instance to synchronize on + */ +class StorageMemoryPool(memoryManager: Object) extends MemoryPool(memoryManager) with Logging { + + @GuardedBy("memoryManager") + private[this] var _memoryUsed: Long = 0L + + override def memoryUsed: Long = memoryManager.synchronized { +_memoryUsed + } + + private var _memoryStore: MemoryStore = _ + def memoryStore: MemoryStore = { +if (_memoryStore == null) { + throw new IllegalArgumentException("memory store not initialized yet") --- End diff -- illegal state? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8984#issuecomment-153531477 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9394#issuecomment-153531026 **[Test build #44976 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44976/consoleFull)** for PR 9394 at commit [`12f3a74`](https://github.com/apache/spark/commit/12f3a742db8f9dde54b7f8e1dfa9e880da1894a4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153530818 **[Test build #44974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/consoleFull)** for PR 9451 at commit [`9a6d9dc`](https://github.com/apache/spark/commit/9a6d9dc1fa097dd015b4bf83d9002a3f3d19d8ec). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153530822 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153530824 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9403#issuecomment-153530716 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9394#issuecomment-153530740 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOC] Missing link to R DataFrame API doc
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9394#issuecomment-153530721 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9403#issuecomment-153530739 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/9383#discussion_r43826631 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala --- @@ -762,15 +679,7 @@ class TungstenAggregationIterator( /** * Start processing input rows. */ - testFallbackStartsAt match { -case None => - processInputs() -case Some(fallbackStartsAt) => - // This is the testing path. processInputsWithControlledFallback is same as processInputs - // except that it switches to sort-based aggregation after `fallbackStartsAt` input rows - // have been processed. - processInputsWithControlledFallback(fallbackStartsAt) - } + processInputs(testFallbackStartsAt.getOrElse(Int.MaxValue)) --- End diff -- It still possible that we basically have lot of memory space to use, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153530486 **[Test build #44974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/consoleFull)** for PR 9451 at commit [`9a6d9dc`](https://github.com/apache/spark/commit/9a6d9dc1fa097dd015b4bf83d9002a3f3d19d8ec). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/9344#discussion_r43826411 --- Diff: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java --- @@ -127,23 +127,30 @@ public TaskMemoryManager(MemoryManager memoryManager, long taskAttemptId) { * * @return number of bytes successfully granted (<= N). */ - public long acquireExecutionMemory(long required, MemoryConsumer consumer) { + public long acquireExecutionMemory( + long required, + MemoryMode mode, + MemoryConsumer consumer) { assert(required >= 0); +// If we are allocating Tungsten pages off-heap and receive a request to allocate on-heap +// memory here, then it may not make sense to spill since that would only end up freeing +// off-heap memory. This is subject to change, though, so it may be risky to make this +// optimization now in case we forget to undo it late when making changes. synchronized (this) { - long got = memoryManager.acquireExecutionMemory(required, taskAttemptId); + long got = memoryManager.acquireExecutionMemory(required, taskAttemptId, mode); - // try to release memory from other consumers first, then we can reduce the frequency of + // Try to release memory from other consumers first, then we can reduce the frequency of // spilling, avoid to have too many spilled files. if (got < required) { // Call spill() on other consumers to release memory for (MemoryConsumer c: consumers) { - if (c != null && c != consumer && c.getUsed() > 0) { + if (c != consumer && c.getMemoryUsed(mode) > 0) { --- End diff -- Because we'd end up storing `null` into the consumers map when allocating memory that was not requested by a particular consumer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/9344#discussion_r43826383 --- Diff: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java --- @@ -161,10 +168,10 @@ public long acquireExecutionMemory(long required, MemoryConsumer consumer) { if (got < required && consumer != null) { try { long released = consumer.spill(required - got, consumer); - if (released > 0) { + if (released > 0 && mode == tungstenMemoryMode) { logger.info("Task {} released {} from itself ({})", taskAttemptId, --- End diff -- debug level --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/9344#discussion_r43826374 --- Diff: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java --- @@ -127,23 +127,30 @@ public TaskMemoryManager(MemoryManager memoryManager, long taskAttemptId) { * * @return number of bytes successfully granted (<= N). */ - public long acquireExecutionMemory(long required, MemoryConsumer consumer) { + public long acquireExecutionMemory( + long required, + MemoryMode mode, + MemoryConsumer consumer) { assert(required >= 0); +// If we are allocating Tungsten pages off-heap and receive a request to allocate on-heap +// memory here, then it may not make sense to spill since that would only end up freeing +// off-heap memory. This is subject to change, though, so it may be risky to make this +// optimization now in case we forget to undo it late when making changes. synchronized (this) { - long got = memoryManager.acquireExecutionMemory(required, taskAttemptId); + long got = memoryManager.acquireExecutionMemory(required, taskAttemptId, mode); - // try to release memory from other consumers first, then we can reduce the frequency of + // Try to release memory from other consumers first, then we can reduce the frequency of // spilling, avoid to have too many spilled files. if (got < required) { // Call spill() on other consumers to release memory for (MemoryConsumer c: consumers) { - if (c != null && c != consumer && c.getUsed() > 0) { + if (c != consumer && c.getMemoryUsed(mode) > 0) { try { long released = c.spill(required - got, consumer); - if (released > 0) { + if (released > 0 && mode == tungstenMemoryMode) { logger.info("Task {} released {} from {} for {}", taskAttemptId, --- End diff -- debug level? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-153529973 @cloud-fan @dbtsai , Jenkins did not start the testing. Could you let Jenkins to test it? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/9344#discussion_r43826353 --- Diff: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java --- @@ -127,23 +127,30 @@ public TaskMemoryManager(MemoryManager memoryManager, long taskAttemptId) { * * @return number of bytes successfully granted (<= N). */ - public long acquireExecutionMemory(long required, MemoryConsumer consumer) { + public long acquireExecutionMemory( + long required, + MemoryMode mode, + MemoryConsumer consumer) { assert(required >= 0); +// If we are allocating Tungsten pages off-heap and receive a request to allocate on-heap +// memory here, then it may not make sense to spill since that would only end up freeing +// off-heap memory. This is subject to change, though, so it may be risky to make this +// optimization now in case we forget to undo it late when making changes. synchronized (this) { - long got = memoryManager.acquireExecutionMemory(required, taskAttemptId); + long got = memoryManager.acquireExecutionMemory(required, taskAttemptId, mode); - // try to release memory from other consumers first, then we can reduce the frequency of + // Try to release memory from other consumers first, then we can reduce the frequency of // spilling, avoid to have too many spilled files. if (got < required) { // Call spill() on other consumers to release memory for (MemoryConsumer c: consumers) { - if (c != null && c != consumer && c.getUsed() > 0) { + if (c != consumer && c.getMemoryUsed(mode) > 0) { --- End diff -- how come we needed the null check before? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153529807 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/9344#discussion_r43826213 --- Diff: core/src/main/java/org/apache/spark/memory/MemoryConsumer.java --- @@ -74,26 +78,26 @@ public void spill() throws IOException { public abstract long spill(long size, MemoryConsumer trigger) throws IOException; /** - * Acquire `size` bytes memory. + * Acquire `size` bytes of on-heap execution memory. * * If there is not enough memory, throws OutOfMemoryError. */ protected void acquireMemory(long size) { -long got = taskMemoryManager.acquireExecutionMemory(size, this); +long got = taskMemoryManager.acquireExecutionMemory(size, MemoryMode.ON_HEAP, this); --- End diff -- it's a little confusing that this method only supports ON_HEAP. Maybe rename the method? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153529828 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153529631 Regarding the format of the title, we can do `[SPARK-x] [SQL] ...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153529582 How about we update the title to include the jira? Is https://issues.apache.org/jira/browse/SPARK-9372 the right one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9385#discussion_r43826123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1019,7 +1019,16 @@ class Analyzer( * scoping information for attributes and can be removed once analysis is complete. */ object EliminateSubQueries extends Rule[LogicalPlan] { - def apply(plan: LogicalPlan): LogicalPlan = plan transform { + def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { +case Project(projectList, child: Subquery) => { + Project( +projectList.flatMap { --- End diff -- Thank you! I did the change based on your suggestion. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153529478 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9452#issuecomment-153529252 **[Test build #44973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44973/consoleFull)** for PR 9452 at commit [`8ce9696`](https://github.com/apache/spark/commit/8ce96964b90cea68eb9c6237a445284730bc2802). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9452#issuecomment-153529071 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9452#issuecomment-153529161 **[Test build #1974 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1974/consoleFull)** for PR 9452 at commit [`8ce9696`](https://github.com/apache/spark/commit/8ce96964b90cea68eb9c6237a445284730bc2802). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9452#issuecomment-153529047 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9453#issuecomment-153528846 **[Test build #44970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44970/consoleFull)** for PR 9453 at commit [`d77c921`](https://github.com/apache/spark/commit/d77c9216ee4e2dae1b07633341dbc9f90d32af8d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153528665 **[Test build #44972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44972/consoleFull)** for PR 9383 at commit [`6f3bb15`](https://github.com/apache/spark/commit/6f3bb15b19cd326f677f15860cf215f57fd3671a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9452#issuecomment-153528562 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44971/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9452#issuecomment-153528560 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9452#issuecomment-153528326 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153528306 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153528327 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9453#issuecomment-153528320 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9452#issuecomment-153528303 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9453#issuecomment-153528291 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10863][SPARKR] Method coltypes() to get...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/8984#discussion_r43825376 --- Diff: R/pkg/R/DataFrame.R --- @@ -1914,3 +1914,34 @@ setMethod("attach", } attach(newEnv, pos = pos, name = name, warn.conflicts = warn.conflicts) }) + +#' Returns the column types of a DataFrame. +#' +#' @name coltypes +#' @title Get column types of a DataFrame +#' @param x (DataFrame) +#' @return value (character) A character vector with the column types of the given DataFrame +#' @rdname coltypes +setMethod("coltypes", + signature(x = "DataFrame"), + function(x) { +# Get the data types of the DataFrame by invoking dtypes() function +types <- lapply(dtypes(x), function(x) {x[[2]]}) + +# Map Spark data types into R's data types using DATA_TYPES environment +rTypes <- lapply(types, function(x) { + if (exists(x, envir=DATA_TYPES)) { --- End diff -- nit: you could probably do ``` type <- DATA_TYPES[[x]] if (is.null(x)) { stop("error") } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9453#issuecomment-153527809 cc @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153527858 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9858] [SQL] Add an ExchangeCoordinator ...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/9453 [SPARK-9858] [SQL] Add an ExchangeCoordinator to estimate the number of post-shuffle partitions for aggregates and joins (follow-up) https://issues.apache.org/jira/browse/SPARK-9858 This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses @JoshRosen's comments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark numReducer-followUp Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9453.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9453 commit d77c9216ee4e2dae1b07633341dbc9f90d32af8d Author: Yin Huai Date: 2015-11-03T23:53:03Z Address Josh's comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11493] remove bitset from BytesToBytesM...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/9452 [SPARK-11493] remove bitset from BytesToBytesMap Since we have 4 bytes as number of records in the beginning of a page, the address can not be zero, so we do not need the bitset. For performance concerns, the bitset could help speed up false lookup if the slot is empty (because bitset is smaller than longArray, cache hit rate will be higher). In practice, the map is filled with 35% - 70% (use 50% as average), so only half of the false lookups can benefit of it, all others will pay the cost of load the bitset (still need to access the longArray anyway). For aggregation, we always need to access the longArray (insert a new key after false lookup), also confirmed by a benchmark. For broadcast hash join, there could be a regression, but a simple benchmark showed that it may not (most of lookup are false): ``` sqlContext.range(1<<20).write.parquet("small") df = sqlContext.read.parquet('small') for i in range(3): t = time.time() df2 = sqlContext.range(1<<26) df2.join(df, df.id == df2.id).count() print time.time() -t ``` Having bitset (used time in seconds): ``` 15.9311430454 11.19282794 9.98717904091 ``` After removing bitset (used time in seconds): ``` 14.6036689281 9.53816604614 8.95434498787 ``` cc @rxin @nongli You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark remove_bitset Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9452.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9452 commit 6ea56eaf35762df5c04440a6005ccf3bf88dabde Author: Davies Liu Date: 2015-11-03T22:49:38Z remove bitset --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153527297 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153527299 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44959/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11425] [SPARK-11486] Improve Hybrid agg...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9383#issuecomment-153527227 **[Test build #44959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44959/consoleFull)** for PR 9383 at commit [`6f3bb15`](https://github.com/apache/spark/commit/6f3bb15b19cd326f677f15860cf215f57fd3671a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11410] [SQL] Add APIs to provide functi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/9364#discussion_r43823876 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -997,4 +1001,116 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { } } } + + /** + * Verifies that there is no Exchange between the Aggregations for `df` + */ + private def verifyNonExchangingAgg(df: DataFrame) = { +var atFirstAgg: Boolean = false +df.queryExecution.executedPlan.foreach { + case agg: TungstenAggregate => { +atFirstAgg = !atFirstAgg + } + case _ => { +if (atFirstAgg) { + fail("Should not have operators between the two aggregations") +} + } +} + } + + /** + * Verifies that there is an Exchange between the Aggregations for `df` + */ + private def verifyExchangingAgg(df: DataFrame) = { +var atFirstAgg: Boolean = false +df.queryExecution.executedPlan.foreach { + case agg: TungstenAggregate => { +if (atFirstAgg) { + fail("Should not have back to back Aggregates") +} +atFirstAgg = true + } + case e: Exchange => atFirstAgg = false + case _ => +} + } + + test("distributeBy and localSort") { +val original = testData.repartition(1) +assert(original.rdd.partitions.length == 1) +val df = original.distributeBy(Column("key") :: Nil, 5) +assert(df.rdd.partitions.length == 5) +checkAnswer(original.select(), df.select()) + +val df2 = original.distributeBy(Column("key") :: Nil, 10) +assert(df2.rdd.partitions.length == 10) +checkAnswer(original.select(), df2.select()) + +// Group by the column we are distributed by. This should generate a plan with no exchange +// between the aggregates +val df3 = testData.distributeBy(Column("key") :: Nil).groupBy("key").count() +verifyNonExchangingAgg(df3) +verifyNonExchangingAgg(testData.distributeBy(Column("key") :: Column("value") :: Nil) + .groupBy("key", "value").count()) + +// Grouping by just the first distributeBy expr, need to exchange. +verifyExchangingAgg(testData.distributeBy(Column("key") :: Column("value") :: Nil) + .groupBy("key").count()) + +val data = sqlContext.sparkContext.parallelize( + (1 to 100).map(i => TestData2(i % 10, i))).toDF() + +// Distribute and order by. +val df4 = data.distributeBy(Column("a") :: Nil).localSort($"b".desc) +// Walk each partition and verify that it is sorted descending and does not contain all +// the values. +df4.rdd.foreachPartition(p => { --- End diff -- for future reference, we usually do ```scala df4.rdd.foreachPartition { p => ... } ``` rather than `(p => {` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8467][MLlib][PySpark] Add LDAModel.desc...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/8643#discussion_r43823836 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/LDAModelWrapper.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.mllib.api.python + +import scala.collection.JavaConverters + +import org.apache.spark.SparkContext +import org.apache.spark.mllib.clustering.LDAModel +import org.apache.spark.mllib.linalg.Matrix + +/** + * Wrapper around LDAModel to provide helper methods in Python + */ +private[python] class LDAModelWrapper(model: LDAModel) { + + def topicsMatrix(): Matrix = model.topicsMatrix + + def vocabSize(): Int = model.vocabSize + + def describeTopics(): java.util.List[Array[Any]] = describeTopics(this.model.vocabSize) + + def describeTopics(maxTermsPerTopic: Int): java.util.List[Array[Any]] = { + +val seq = model.describeTopics(maxTermsPerTopic).map { case (terms, termWeights) => +Array.empty[Any] ++ terms ++ termWeights + }.toSeq +JavaConverters.seqAsJavaListConverter(seq).asJava --- End diff -- @davies could you give me a little bit help? I tried to serialize the entire list after converting into `Array[Any](...)`. And then, when deserializing it in Python, there was something wrong with pickle : `TypeError: must be a unicode character, not bytes`. - My patch https://github.com/yu-iskw/spark/compare/SPARK-8467-2...yu-iskw:SPARK-8467-2.trial - unit testing errors https://gist.github.com/yu-iskw/60e0db67b1e222fc7fd4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix typo in WebUI
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9444 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix typo in WebUI
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/9444#issuecomment-153522488 @jaceklaskowski I've merged this -- but might be better in the future to batch a bunch of typo fixes, since you are so great at finding them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7316] [MLlib] RDD sliding window with s...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5855#discussion_r43822533 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala --- @@ -66,36 +69,54 @@ class SlidingRDD[T: ClassTag](@transient val parent: RDD[T], val windowSize: Int if (n == 0) { Array.empty } else if (n == 1) { - Array(new SlidingRDDPartition[T](0, parentPartitions(0), Seq.empty)) + Array(new SlidingRDDPartition[T](0, parentPartitions(0), Seq.empty, 0)) } else { val n1 = n - 1 - val w1 = windowSize - 1 - // Get the first w1 items of each partition, starting from the second partition. - val nextHeads = -parent.context.runJob(parent, (iter: Iterator[T]) => iter.take(w1).toArray, 1 until n) + // Get partitions sizes + val sizes = +parent.context.runJob(parent, (iter: Iterator[T]) => iter.length, 0 until n) --- End diff -- If `step > 1`, in a single Spark job, we can collect the following: 1. the first `w1` elements from each partition (except the first one) 2. the size of each partition Then we can compute the offset for each partition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9449#issuecomment-153521166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9449#issuecomment-153521119 **[Test build #44968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44968/consoleFull)** for PR 9449 at commit [`1f64c07`](https://github.com/apache/spark/commit/1f64c07b7303389cd21b47c51fecc7bb2e31f7e9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9449#issuecomment-153521169 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44968/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7316] [MLlib] RDD sliding window with s...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5855#discussion_r43822024 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala --- @@ -66,36 +69,54 @@ class SlidingRDD[T: ClassTag](@transient val parent: RDD[T], val windowSize: Int if (n == 0) { Array.empty } else if (n == 1) { - Array(new SlidingRDDPartition[T](0, parentPartitions(0), Seq.empty)) + Array(new SlidingRDDPartition[T](0, parentPartitions(0), Seq.empty, 0)) } else { val n1 = n - 1 - val w1 = windowSize - 1 - // Get the first w1 items of each partition, starting from the second partition. - val nextHeads = -parent.context.runJob(parent, (iter: Iterator[T]) => iter.take(w1).toArray, 1 until n) + // Get partitions sizes --- End diff -- Keep the original implementation if `step == 1` to save one job. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11477][SQL] support create Dataset from...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9434 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9451#issuecomment-153519356 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11477][SQL] support create Dataset from...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/9434#issuecomment-153518772 Thanks, merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...
GitHub user vidma opened a pull request: https://github.com/apache/spark/pull/9451 WIP: Optimize Inner joins with skewed null values Draft of first step in optimizing skew in joins (it is quite common to have skew in data, and lots of `nulls` on either side of join is quite common (for us), especially with left join, say when joining `dimensions` to `fact` tables) feel free to propose a better approach / add commits. any ideas for an easy way to check if the rule was already applied? After adding a `isNotNull` filter `someAttribute.nullable` still returns `true`. I couldn't come up with anything better than simply doing a separate batch of 1 iteration. @marmbrus (as discussed at Spark Summit EU) --- going more serious, a draft for fighting skew in left join is [rather simple with DataFrames](https://gist.github.com/vidma/98332db0f82e7e5b09e5), solves the null skew, and don't seem to add lots of overhead (though tried only on subset of all our joins which used another abstraction of ours). however this, so far, seems harder to express in optimizer rules: - need to add "fake" colums. no idea yet how to do this to be able to refer to the added column in join conditions ```scala val leftNullsSprayValue = CaseWhen( Seq( nullableJoinKeys(left).map(IsNull).reduceLeft(Or), // if any join keys are null Cast(Multiply(new Rand(), Literal(10)), IntegerType), Literal(0) // otherwise )) // but how to add this column to left & right relations? // e.g. this fails, saying it's not `resolved` Alias(leftNullsSprayValue)("leftNullsSprayKey")() ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/vidma/spark feature/fight-skew-in-inner-join Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9451.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9451 commit 214deeae2d4c634536df0d9bd6c2ffcfc573ce7b Author: vidmantas zemleris Date: 2015-11-03T22:38:08Z Optimize Inner joins with skewed null values --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9450#issuecomment-153518606 **[Test build #44969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44969/consoleFull)** for PR 9450 at commit [`6ffe0b0`](https://github.com/apache/spark/commit/6ffe0b0540007a96a75d48a75303aad0b45fc9b0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9450#issuecomment-153518453 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9450#issuecomment-153518428 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r43821203 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -37,3 +35,39 @@ trait Encoder[T] extends Serializable { /** A ClassTag that can be used to construct and Array to contain a collection of `T`. */ def clsTag: ClassTag[T] } + +object Encoder { + import scala.reflect.runtime.universe._ + + def forBoolean: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true) --- End diff -- @rxin @mateiz thoughts on naming here? `forInt` `int` `INT`? `forTuple` `tuple`, `tuple2`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/9450#issuecomment-153518050 /cc @srowen @pwendell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11491] Update build to use Scala 2.10.5
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/9450 [SPARK-11491] Update build to use Scala 2.10.5 Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479 You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark upgrade-to-scala-2.10.5 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9450.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9450 commit 6ffe0b0540007a96a75d48a75303aad0b45fc9b0 Author: Josh Rosen Date: 2015-11-03T23:08:57Z Update build to use Scala 2.10.5. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/9358#issuecomment-153517822 It would be really great to also try and create a test suite that uses java 8 lambdas (though we may need to pull this into a separate PR as I'm not sure how many build changes we will need) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r43820879 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -37,3 +35,39 @@ trait Encoder[T] extends Serializable { /** A ClassTag that can be used to construct and Array to contain a collection of `T`. */ def clsTag: ClassTag[T] } + +object Encoder { + import scala.reflect.runtime.universe._ + + def forBoolean: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true) + def forByte: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true) + def forShort: Encoder[java.lang.Short] = ExpressionEncoder(flat = true) + def forInt: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true) + def forLong: Encoder[java.lang.Long] = ExpressionEncoder(flat = true) + def forFloat: Encoder[java.lang.Float] = ExpressionEncoder(flat = true) + def forDouble: Encoder[java.lang.Double] = ExpressionEncoder(flat = true) + def forString: Encoder[java.lang.String] = ExpressionEncoder(flat = true) + + def typeTagOfTuple2[T1 : TypeTag, T2 : TypeTag]: TypeTag[(T1, T2)] = typeTag[(T1, T2)] --- End diff -- `private` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r43820849 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -37,3 +35,39 @@ trait Encoder[T] extends Serializable { /** A ClassTag that can be used to construct and Array to contain a collection of `T`. */ def clsTag: ClassTag[T] } + +object Encoder { + import scala.reflect.runtime.universe._ + + def forBoolean: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true) + def forByte: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true) + def forShort: Encoder[java.lang.Short] = ExpressionEncoder(flat = true) + def forInt: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true) + def forLong: Encoder[java.lang.Long] = ExpressionEncoder(flat = true) + def forFloat: Encoder[java.lang.Float] = ExpressionEncoder(flat = true) + def forDouble: Encoder[java.lang.Double] = ExpressionEncoder(flat = true) + def forString: Encoder[java.lang.String] = ExpressionEncoder(flat = true) + + def typeTagOfTuple2[T1 : TypeTag, T2 : TypeTag]: TypeTag[(T1, T2)] = typeTag[(T1, T2)] + + private def getTypeTag[T](c: Class[T]): TypeTag[T] = { +import scala.reflect.api + +// val mirror = runtimeMirror(c.getClassLoader) +val mirror = rootMirror +val sym = mirror.staticClass(c.getName) +val tpe = sym.selfType +TypeTag(mirror, new api.TypeCreator { + def apply[U <: api.Universe with Singleton](m: api.Mirror[U]) = +if (m eq mirror) tpe.asInstanceOf[U # Type] +else throw new IllegalArgumentException( + s"Type tag defined in $mirror cannot be migrated to other mirrors.") +}) + } + + def forTuple2[T1, T2](c1: Class[T1], c2: Class[T2]): Encoder[(T1, T2)] = { --- End diff -- How about just `forTuple`, the type of tuple returned is obvious from the number of arguments. We should also add at least up to tuple 5. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r43820524 --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java --- @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package test.org.apache.spark.sql; + +import java.io.Serializable; +import java.util.Arrays; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; + +import org.apache.spark.api.java.function.FlatMapFunction; +import org.apache.spark.sql.catalyst.encoders.Encoder; +import scala.Tuple2; + +import org.apache.spark.api.java.function.Function; +import org.junit.*; + +import org.apache.spark.SparkContext; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.catalyst.encoders.Encoder$; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.test.TestSQLContext; + +public class JavaDatasetSuite implements Serializable { + private transient JavaSparkContext jsc; + private transient TestSQLContext context; + + @Before + public void setUp() { +// Trigger static initializer of TestData +SparkContext sc = new SparkContext("local[*]", "testing"); +jsc = new JavaSparkContext(sc); +context = new TestSQLContext(sc); +context.loadTestData(); + } + + @After + public void tearDown() { +context.sparkContext().stop(); +context = null; +jsc = null; + } + + @Test + public void testCommonOperation() { +List data = Arrays.asList("hello", "world"); +Dataset ds = context.createDataset(data, Encoder$.MODULE$.forString()); --- End diff -- Yeah, scala should create static methods so that you can just call `Encoder.forString()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r43820397 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -441,6 +537,17 @@ class Dataset[T] private( /** Collects the elements to an Array. */ def collect(): Array[T] = rdd.collect() + /** + * (Java-specific) + * Collects the elements to a Java list. + * + * Due to the incompatibility problem between Scala and Java, the return type of [[collect()]] at --- End diff -- This just means that the RDD has the wrong classtag. We need to find a way to pass the classtag from the encoder before calling collect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r43820335 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -499,6 +499,10 @@ class SQLContext private[sql]( new Dataset[T](this, plan) } + def createDataset[T : Encoder](data: java.util.List[T]): Dataset[T] = { --- End diff -- Oh, you already did it :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11477][SQL] support create Dataset from...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9434#discussion_r43820137 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -499,6 +499,15 @@ class SQLContext private[sql]( new Dataset[T](this, plan) } + def createDataset[T : Encoder](data: RDD[T]): Dataset[T] = { +val enc = encoderFor[T] +val attributes = enc.schema.toAttributes +val encoded = data.map(d => enc.toRow(d)) --- End diff -- No, its the responsibility of downstream operators to copy as needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r43820035 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -499,6 +499,10 @@ class SQLContext private[sql]( new Dataset[T](this, plan) } + def createDataset[T : Encoder](data: java.util.List[T]): Dataset[T] = { --- End diff -- Yeah, lets do that in another PR please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9449#issuecomment-153515606 **[Test build #44968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44968/consoleFull)** for PR 9449 at commit [`1f64c07`](https://github.com/apache/spark/commit/1f64c07b7303389cd21b47c51fecc7bb2e31f7e9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9449#issuecomment-153515293 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9449#issuecomment-153515313 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/9449#issuecomment-153514954 Note that this also includes changes from #9446. Look at the last commit for diff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11490][SQL] variance should alias var_s...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/9449 [SPARK-11490][SQL] variance should alias var_samp instead of var_pop. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-11490 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9449.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9449 commit 856942f9d0677d381dfe18d8777fa6f0a2e858c8 Author: Reynold Xin Date: 2015-11-03T21:53:18Z [SPARK-11489][SQL] Only include common first order statistics in GroupedData commit 132ea9a1769bbaf1d0e0c662d26a947ef34dd73f Author: Reynold Xin Date: 2015-11-03T22:05:06Z Fix test commit 1f64c07b7303389cd21b47c51fecc7bb2e31f7e9 Author: Reynold Xin Date: 2015-11-03T22:54:23Z [SPARK-11490][SQL] variance should alias var_samp instead of var_pop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3816#issuecomment-153514540 **[Test build #44967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44967/consoleFull)** for PR 3816 at commit [`247ce55`](https://github.com/apache/spark/commit/247ce5587b8a06fe586a2730f1ad2df4ab7f79dc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3816#issuecomment-153513337 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3816#issuecomment-153513322 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10429] [SQL] make mutableProjection ato...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9422#issuecomment-153513069 ah sorry just saw this message. Since the changed behavior is easier to reason about, let's keep it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3816#issuecomment-153513026 I think this is the right fix. However, more pairs of eyes on this change would be better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-11478] [ML] ML StringIndexer return inc...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/9440#issuecomment-153512622 cc @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3816#issuecomment-153512508 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11378][STREAMING] make StreamingContext...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/9336#issuecomment-153512190 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9385#discussion_r43817697 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1019,7 +1019,16 @@ class Analyzer( * scoping information for attributes and can be removed once analysis is complete. */ object EliminateSubQueries extends Rule[LogicalPlan] { - def apply(plan: LogicalPlan): LogicalPlan = plan transform { + def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { --- End diff -- Thank you for your comments! If we doing transformUp, subquery will be removed at first. Then, Project(projectList, child: Subquery) will not be applicable in this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11359][STREAMING][KINESIS] Checkpoint t...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/9421#discussion_r43817494 --- Diff: extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisCheckpointState.scala --- @@ -16,39 +16,77 @@ */ package org.apache.spark.streaming.kinesis +import java.util.concurrent._ + +import scala.util.control.NonFatal + +import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorCheckpointer +import com.amazonaws.services.kinesis.clientlibrary.types.ShutdownReason + import org.apache.spark.Logging import org.apache.spark.streaming.Duration -import org.apache.spark.util.{Clock, ManualClock, SystemClock} /** - * This is a helper class for managing checkpoint clocks. + * This is a helper class for managing Kinesis checkpointing. * - * @param checkpointInterval - * @param currentClock. Default to current SystemClock if none is passed in (mocking purposes) + * @param receiver The receiver that keeps track of which sequence numbers we can checkpoint + * @param checkpointInterval How frequently we will checkpoint to DynamoDB + * @param workerId Worker Id of KCL worker for logging purposes + * @param shardId The shard this worker was consuming data from */ -private[kinesis] class KinesisCheckpointState( +private[kinesis] class KinesisCheckpointState[T]( +receiver: KinesisReceiver[T], checkpointInterval: Duration, -currentClock: Clock = new SystemClock()) - extends Logging { +workerId: String, +shardId: String) extends Logging { - /* Initialize the checkpoint clock using the given currentClock + checkpointInterval millis */ - val checkpointClock = new ManualClock() - checkpointClock.setTime(currentClock.getTimeMillis() + checkpointInterval.milliseconds) + private var _checkpointer: Option[IRecordProcessorCheckpointer] = None --- End diff -- This should be `volatile` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9216#issuecomment-153511090 @JoshRosen @cloud-fan I submitted a pull request for JIRA Spark-11275: https://github.com/apache/spark/pull/9419 Hopefully, after finishing the problem, this one can pass all the tests. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11359][STREAMING][KINESIS] Checkpoint t...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/9421#discussion_r43817473 --- Diff: extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisCheckpointState.scala --- @@ -16,39 +16,77 @@ */ package org.apache.spark.streaming.kinesis +import java.util.concurrent._ + +import scala.util.control.NonFatal + +import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorCheckpointer +import com.amazonaws.services.kinesis.clientlibrary.types.ShutdownReason + import org.apache.spark.Logging import org.apache.spark.streaming.Duration -import org.apache.spark.util.{Clock, ManualClock, SystemClock} /** - * This is a helper class for managing checkpoint clocks. + * This is a helper class for managing Kinesis checkpointing. * - * @param checkpointInterval - * @param currentClock. Default to current SystemClock if none is passed in (mocking purposes) + * @param receiver The receiver that keeps track of which sequence numbers we can checkpoint + * @param checkpointInterval How frequently we will checkpoint to DynamoDB + * @param workerId Worker Id of KCL worker for logging purposes + * @param shardId The shard this worker was consuming data from */ -private[kinesis] class KinesisCheckpointState( +private[kinesis] class KinesisCheckpointState[T]( +receiver: KinesisReceiver[T], checkpointInterval: Duration, -currentClock: Clock = new SystemClock()) - extends Logging { +workerId: String, +shardId: String) extends Logging { - /* Initialize the checkpoint clock using the given currentClock + checkpointInterval millis */ - val checkpointClock = new ManualClock() - checkpointClock.setTime(currentClock.getTimeMillis() + checkpointInterval.milliseconds) + private var _checkpointer: Option[IRecordProcessorCheckpointer] = None - /** - * Check if it's time to checkpoint based on the current time and the derived time - * for the next checkpoint - * - * @return true if it's time to checkpoint - */ - def shouldCheckpoint(): Boolean = { -new SystemClock().getTimeMillis() > checkpointClock.getTimeMillis() + private val checkpointerThread = startCheckpointerThread() + + /** Update the checkpointer instance to the most recent one. */ + def setCheckpointer(checkpointer: IRecordProcessorCheckpointer): Unit = { +_checkpointer = Option(checkpointer) + } + + /** Perform the checkpoint */ + private def checkpoint(checkpointer: Option[IRecordProcessorCheckpointer]): Unit = { +// if this method throws an exception, then the scheduled task will not run again +try { + checkpointer.foreach { cp => --- End diff -- What will happen if we use an old `IRecordProcessorCheckpointer` to checkpoint? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11389][CORE] Add support for off-heap m...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9344#issuecomment-153509965 **[Test build #44965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44965/consoleFull)** for PR 9344 at commit [`96705b8`](https://github.com/apache/spark/commit/96705b881caa82b87874d5145c16ce232c4a64a4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9403#issuecomment-153509530 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44956/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11198][STREAMING][KINESIS] Support de-a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9403#issuecomment-153509526 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org