[GitHub] spark issue #15481: [SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSche...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15481 LGTM now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15481: [SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSche...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15481 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67322/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15481: [SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSche...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15481 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15481: [SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSche...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15481 **[Test build #67322 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67322/consoleFull)** for PR 15481 at commit [`7bf3bf8`](https://github.com/apache/spark/commit/7bf3bf8606f261e2eb1bebd3c6e3c3ff8600e140). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15526: [SPARK-17986] [ML] SQLTransformer should remove temporar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15526 **[Test build #67327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67327/consoleFull)** for PR 15526 at commit [`d5c3b41`](https://github.com/apache/spark/commit/d5c3b419942f1d3b9af265b540a9404d3e8295df). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15526: [SPARK-17986] [ML] SQLTransformer should remove temporar...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/15526 Jenkins add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13891: [SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13891 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13891: [SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13891 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67323/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13891: [SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS
Github user hqzizania commented on the issue: https://github.com/apache/spark/pull/13891 @mengxr I see. I will add a param for it. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13891: [SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13891 **[Test build #67323 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67323/consoleFull)** for PR 13891 at commit [`dc4f4ba`](https://github.com/apache/spark/commit/dc4f4badba26635aa95c6ade5a589d4bd50ae886). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13891: [SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/13891 @hqzizania Thanks for the performance tests! This matches my guess. I'm not sure how often people use a rank greater than 1000 or even 250. But I think it is good to use BLAS level-3 routines. We can make the threshold a param and set a small threshold and test both code paths. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67320/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)** for PR 15575 at commit [`cfcd2a7`](https://github.com/apache/spark/commit/cfcd2a79231ed16c2a3e82551e87a6de888c2af6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15539: [SPARK-17994] [SQL] Add back a file status cache ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15539#discussion_r84424330 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala --- @@ -294,7 +308,7 @@ object PartitioningAwareFileCatalog extends Logging { private def listLeafFilesInParallel( paths: Seq[Path], hadoopConf: Configuration, - sparkSession: SparkSession): Seq[FileStatus] = { + sparkSession: SparkSession): Map[Path, Seq[FileStatus]] = { --- End diff -- Updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15539: [SPARK-17994] [SQL] Add back a file status cache ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15539#discussion_r84424519 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala --- @@ -64,11 +66,18 @@ class ListingFileCatalog( } override def refresh(): Unit = { +refresh0(true) --- End diff -- Good idea --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15539: [SPARK-17994] [SQL] Add back a file status cache ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15539#discussion_r84424418 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TableFileCatalog.scala --- @@ -42,24 +43,21 @@ class TableFileCatalog( protected val hadoopConf = sparkSession.sessionState.newHadoopConf + private val fileStatusCache = FileStatusCache.getOrInitializeShared(new Object(), sparkSession) --- End diff -- The way we currently refresh tables is by dropping the reference to TableFileCatalog and letting it get GC'ed. Given this strategy, making it private is the simplest way to ensure refresh actually works correctly -- otherwise you have to carefully test that refresh also invalidates shared cache entries. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15539: [SPARK-17994] [SQL] Add back a file status cache ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15539#discussion_r84424682 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.util.concurrent.ConcurrentHashMap +import java.util.concurrent.atomic.AtomicBoolean + +import scala.collection.JavaConverters._ + +import com.google.common.cache._ +import org.apache.hadoop.fs.{FileStatus, Path} + +import org.apache.spark.internal.Logging +import org.apache.spark.metrics.source.HiveCatalogMetrics +import org.apache.spark.sql.SparkSession +import org.apache.spark.util.{SerializableConfiguration, SizeEstimator} + +/** + * A cache of the leaf files of partition directories. We cache these files in order to speed + * up iterated queries over the same set of partitions. Otherwise, each query would have to + * hit remote storage in order to gather file statistics for physical planning. + * + * Each resolved catalog table has its own FileStatusCache. When the backing relation for the + * table is refreshed via refreshTable() or refreshByPath(), this cache will be invalidated. + */ +abstract class FileStatusCache { + /** + * @return the leaf files for the specified path from this cache, or None if not cached. + */ + def getLeafFiles(path: Path): Option[Array[FileStatus]] = None + + /** + * Saves the given set of leaf files for a path in this cache. + */ + def putLeafFiles(path: Path, leafFiles: Array[FileStatus]): Unit + + /** + * Invalidates all data held by this cache. + */ + def invalidateAll(): Unit +} + +object FileStatusCache { + // Opaque object that uniquely identifies a shared cache user + type ClientId = Object --- End diff -- I think it's better to err on the side of isolation here. Otherwise, it is harder to reason about what is actually invalidated when a table is refreshed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15539: [SPARK-17994] [SQL] Add back a file status cache for cat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15539 **[Test build #67326 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67326/consoleFull)** for PR 15539 at commit [`262f6ee`](https://github.com/apache/spark/commit/262f6eed6de2a290c7b2ce6b0efefb05d0754629). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r84424097 --- Diff: docs/configuration.md --- @@ -1342,6 +1342,20 @@ Apart from these, the following properties are also available, and may be useful Should be greater than or equal to 1. Number of allowed retries = this value - 1. + + spark.scheduler.taskAssigner + roundrobin + +The strategy of how to allocate tasks among workers with free cores. There are three task +assigners (roundrobin, packed, and balanced) are supported currently. By default, roundrobin +with randomness is used, which tries to allocate task to workers with available cores in +roundrobin manner.The packed task assigner tries to allocate tasks to workers with the least --- End diff -- missed space between . and `The` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r84424034 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracks the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + /** The current remaining cores that can be allocated to tasks. */ + var coresAvailable: Int = workOffer.cores + /** The list of tasks that are assigned to this WorkerOffer. */ + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, when TaskScheduler + * perform task assignment given available workers, it first sorts the candidate tasksets, + * and then for each taskset, it takes a number of rounds to request TaskAssigner for task + * assignment with different locality restrictions until there is either no qualified + * workers or no valid tasks to be assigned. + * + * TaskAssigner is responsible to maintain the worker availability state and task assignment + * information. The contract between [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]] + * and TaskAssigner is as follows. + * + * First, TaskScheduler invokes construct() of TaskAssigner to initialize the its internal + * worker states at the beginning of resource offering. + * + * Second, before each round of task assignment for a taskset, TaskScheduler invoke the init() + * of TaskAssigner to initialize the data structure for the round. + * + * Third, when performing real task assignment, hasNext()/getNext() is used by TaskScheduler + * to check the worker availability and retrieve current offering from TaskAssigner. + * + * Fourth, then offerAccepted is used by TaskScheduler to notify the TaskAssigner so that + * TaskAssigner can decide whether the current offer is valid or not for the next request. + * + * Fifth, After task assignment is done, TaskScheduler invokes the tasks() to + * retrieve all the task assignment information. + */ + +private[scheduler] abstract class TaskAssigner { + protected var offer: Seq[OfferState] = _ + protected var cpuPerTask = 1 + + protected def withCpuPerTask(cpuPerTask: Int): Unit = { +this.cpuPerTask = cpuPerTask + } + + /** The final assigned offer returned to TaskScheduler. */ + final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks) + + /** Invoked at the beginning of resource offering to construct the offer with the workoffers. */ + def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = Random.shuffle(workOffer.map(o => new OfferState(o))) + } + + /** Invoked at each round of Taskset assignment to initialize the internal structure. */ + def init(): Unit + + /** + * Tests Whether there is offer available to be used inside of one round of Taskset assignment. + * @return `true` if a subsequent call to `next` will yield an element, + * `false` otherwise. + */ + def hasNext: Boolean + + /** + * Produces next worker offer based on the task assignment strategy. + * @return the next available offer, if `hasNext` is `true`, + * undefined behavior otherwise.
[GitHub] spark issue #15484: [SPARK-17868][SQL] Do not use bitmasks during parsing an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15484 **[Test build #67325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67325/consoleFull)** for PR 15484 at commit [`925a5ca`](https://github.com/apache/spark/commit/925a5cab679f2069309ea50ba3080e2b44b7ea3a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r84423237 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracks the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + /** The current remaining cores that can be allocated to tasks. */ + var coresAvailable: Int = workOffer.cores + /** The list of tasks that are assigned to this WorkerOffer. */ + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, when TaskScheduler + * perform task assignment given available workers, it first sorts the candidate tasksets, + * and then for each taskset, it takes a number of rounds to request TaskAssigner for task + * assignment with different locality restrictions until there is either no qualified + * workers or no valid tasks to be assigned. + * + * TaskAssigner is responsible to maintain the worker availability state and task assignment + * information. The contract between [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]] + * and TaskAssigner is as follows. + * + * First, TaskScheduler invokes construct() of TaskAssigner to initialize the its internal + * worker states at the beginning of resource offering. + * + * Second, before each round of task assignment for a taskset, TaskScheduler invoke the init() + * of TaskAssigner to initialize the data structure for the round. + * + * Third, when performing real task assignment, hasNext()/getNext() is used by TaskScheduler + * to check the worker availability and retrieve current offering from TaskAssigner. + * + * Fourth, then offerAccepted is used by TaskScheduler to notify the TaskAssigner so that + * TaskAssigner can decide whether the current offer is valid or not for the next request. + * + * Fifth, After task assignment is done, TaskScheduler invokes the tasks() to + * retrieve all the task assignment information. + */ + +private[scheduler] abstract class TaskAssigner { + protected var offer: Seq[OfferState] = _ + protected var cpuPerTask = 1 + + protected def withCpuPerTask(cpuPerTask: Int): Unit = { +this.cpuPerTask = cpuPerTask + } + + /** The final assigned offer returned to TaskScheduler. */ + final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks) + + /** Invoked at the beginning of resource offering to construct the offer with the workoffers. */ + def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = Random.shuffle(workOffer.map(o => new OfferState(o))) + } + + /** Invoked at each round of Taskset assignment to initialize the internal structure. */ + def init(): Unit + + /** + * Tests Whether there is offer available to be used inside of one round of Taskset assignment. + * @return `true` if a subsequent call to `next` will yield an element, + * `false` otherwise. + */ + def hasNext: Boolean + + /** + * Produces next worker offer based on the task assignment strategy. + * @return the next available offer, if `hasNext` is `true`, + * undefined behavior othe
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r84422908 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracks the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + /** The current remaining cores that can be allocated to tasks. */ + var coresAvailable: Int = workOffer.cores + /** The list of tasks that are assigned to this WorkerOffer. */ + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, when TaskScheduler + * perform task assignment given available workers, it first sorts the candidate tasksets, + * and then for each taskset, it takes a number of rounds to request TaskAssigner for task + * assignment with different locality restrictions until there is either no qualified + * workers or no valid tasks to be assigned. + * + * TaskAssigner is responsible to maintain the worker availability state and task assignment + * information. The contract between [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]] + * and TaskAssigner is as follows. + * + * First, TaskScheduler invokes construct() of TaskAssigner to initialize the its internal + * worker states at the beginning of resource offering. + * + * Second, before each round of task assignment for a taskset, TaskScheduler invoke the init() + * of TaskAssigner to initialize the data structure for the round. + * + * Third, when performing real task assignment, hasNext()/getNext() is used by TaskScheduler + * to check the worker availability and retrieve current offering from TaskAssigner. + * + * Fourth, then offerAccepted is used by TaskScheduler to notify the TaskAssigner so that + * TaskAssigner can decide whether the current offer is valid or not for the next request. + * + * Fifth, After task assignment is done, TaskScheduler invokes the tasks() to + * retrieve all the task assignment information. + */ + +private[scheduler] abstract class TaskAssigner { + protected var offer: Seq[OfferState] = _ + protected var cpuPerTask = 1 + + protected def withCpuPerTask(cpuPerTask: Int): Unit = { +this.cpuPerTask = cpuPerTask + } + + /** The final assigned offer returned to TaskScheduler. */ + final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks) + + /** Invoked at the beginning of resource offering to construct the offer with the workoffers. */ + def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = Random.shuffle(workOffer.map(o => new OfferState(o))) + } + + /** Invoked at each round of Taskset assignment to initialize the internal structure. */ + def init(): Unit + + /** + * Tests Whether there is offer available to be used inside of one round of Taskset assignment. + * @return `true` if a subsequent call to `next` will yield an element, + * `false` otherwise. + */ + def hasNext: Boolean + + /** + * Produces next worker offer based on the task assignment strategy. + * @return the next available offer, if `hasNext` is `true`, + * undefined behavior othe
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r84422647 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracks the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + /** The current remaining cores that can be allocated to tasks. */ + var coresAvailable: Int = workOffer.cores + /** The list of tasks that are assigned to this WorkerOffer. */ + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, when TaskScheduler + * perform task assignment given available workers, it first sorts the candidate tasksets, + * and then for each taskset, it takes a number of rounds to request TaskAssigner for task + * assignment with different locality restrictions until there is either no qualified + * workers or no valid tasks to be assigned. + * + * TaskAssigner is responsible to maintain the worker availability state and task assignment + * information. The contract between [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]] + * and TaskAssigner is as follows. + * + * First, TaskScheduler invokes construct() of TaskAssigner to initialize the its internal + * worker states at the beginning of resource offering. + * + * Second, before each round of task assignment for a taskset, TaskScheduler invoke the init() + * of TaskAssigner to initialize the data structure for the round. + * + * Third, when performing real task assignment, hasNext()/getNext() is used by TaskScheduler + * to check the worker availability and retrieve current offering from TaskAssigner. + * + * Fourth, then offerAccepted is used by TaskScheduler to notify the TaskAssigner so that + * TaskAssigner can decide whether the current offer is valid or not for the next request. + * + * Fifth, After task assignment is done, TaskScheduler invokes the tasks() to + * retrieve all the task assignment information. + */ + +private[scheduler] abstract class TaskAssigner { + protected var offer: Seq[OfferState] = _ + protected var cpuPerTask = 1 + + protected def withCpuPerTask(cpuPerTask: Int): Unit = { +this.cpuPerTask = cpuPerTask + } + + /** The final assigned offer returned to TaskScheduler. */ + final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks) + + /** Invoked at the beginning of resource offering to construct the offer with the workoffers. */ + def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = Random.shuffle(workOffer.map(o => new OfferState(o))) + } + + /** Invoked at each round of Taskset assignment to initialize the internal structure. */ + def init(): Unit + + /** + * Tests Whether there is offer available to be used inside of one round of Taskset assignment. + * @return `true` if a subsequent call to `next` will yield an element, --- End diff -- `@return` is not aligned with the line above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enable
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67319/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)** for PR 15575 at commit [`2eee2ee`](https://github.com/apache/spark/commit/2eee2eea81d665706a76a5614d26b4401a07f127). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r84422357 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracks the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + /** The current remaining cores that can be allocated to tasks. */ + var coresAvailable: Int = workOffer.cores + /** The list of tasks that are assigned to this WorkerOffer. */ + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, when TaskScheduler + * perform task assignment given available workers, it first sorts the candidate tasksets, + * and then for each taskset, it takes a number of rounds to request TaskAssigner for task + * assignment with different locality restrictions until there is either no qualified + * workers or no valid tasks to be assigned. + * + * TaskAssigner is responsible to maintain the worker availability state and task assignment + * information. The contract between [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]] + * and TaskAssigner is as follows. + * + * First, TaskScheduler invokes construct() of TaskAssigner to initialize the its internal + * worker states at the beginning of resource offering. + * + * Second, before each round of task assignment for a taskset, TaskScheduler invoke the init() --- End diff -- `invoke` -> `invokes` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15568: [SPARK-18028][SQL] simplify TableFileCatalog
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15568 @mallman , item 4 is a potential problem in the future. The current workflow is, we get the `MetastoreRelation` via `HiveMetastoreCatalog.lookupRelation`, which always lower case the database and table name. Then we construct `TableFileCatalog` and call `ExternalCatalog.listPartitionsByFilter` with the database and table name. So we won't have case senstivity problem here. However, we may have a workflow which call `listPartitionsByFilter` directly with database and table name given by users. e.g. #15302. Then we need to care about case sensitivity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14847: [SPARK-17254][SQL] Add StopAfter physical plan for the f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14847 **[Test build #67324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67324/consoleFull)** for PR 14847 at commit [`3c7c1b5`](https://github.com/apache/spark/commit/3c7c1b506b135a1b980d19d98aaa9e41c4fa9018). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r84422079 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracks the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + /** The current remaining cores that can be allocated to tasks. */ + var coresAvailable: Int = workOffer.cores + /** The list of tasks that are assigned to this WorkerOffer. */ + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, when TaskScheduler + * perform task assignment given available workers, it first sorts the candidate tasksets, --- End diff -- `perform` -> `performs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user zhzhan commented on the issue: https://github.com/apache/spark/pull/15541 @gatorsmile I didn't see your new comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r84421949 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracks the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { --- End diff -- Is this class private to `scheduler`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13891: [SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13891 **[Test build #67323 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67323/consoleFull)** for PR 13891 at commit [`dc4f4ba`](https://github.com/apache/spark/commit/dc4f4badba26635aa95c6ade5a589d4bd50ae886). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15568: [SPARK-18028][SQL] simplify TableFileCatalog
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15568#discussion_r84421537 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TableFileCatalog.scala --- @@ -102,6 +95,13 @@ class TableFileCatalog( } override def inputFiles: Array[String] = allPartitions.inputFiles + + override def equals(o: Any): Boolean = o match { --- End diff -- Under hive context, we will cache the `LogicalRelation` for every data source table(including converted from hive), which means every table will always have a `TableFileCatalog` of same instance. However, it's not true in sql core. We will re-construct the `TableFileCatalog` and `LogicalRelation` everytime we look up a table. Thus we may encounter cache miss even if the table is cached, because `TableFileCatalog` of difference instances never equal to each other. Although it's not a real problem now, I think it's reasonable to follow `ListFileCatalong` and add the `equals` and `hashCode` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15319: [SPARK-17733][SQL] InferFiltersFromConstraints rule neve...
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/15319 Thanks @jiangxb1987, this equivalence class approach looks pretty solid. I'll take a closer look tomorrow! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13891: [SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS
Github user hqzizania commented on the issue: https://github.com/apache/spark/pull/13891 @yanboliang So sorry for my late response. Some regression performance test results: Datasets: using [genExplicitTestData](https://github.com/apache/spark/pull/13891/files#diff-2c1dd810bcfabf444a0ecdcd613e954aR393) to generate with numUsers = 2, numItems = 2000 Single-node cluster: 16 physical cores, 100GB memory ALS: numUserBlocks = 30, numItemBlocks = 30 It will run [computeFactors](https://github.com/apache/spark/pull/13891/files#diff-be65dd1d6adc53138156641b610fcadaR1291) with 30 partitions in parallel. ALS: rank = 1024 Computing time and used memory for computeFactors: ![image](https://cloud.githubusercontent.com/assets/9315372/19587668/cb0065a8-9792-11e6-8226-b5a448a8dc9a.png) = ALS: rank = 129 Computing time for computeFactors: ![image](https://cloud.githubusercontent.com/assets/9315372/19587672/d0197d54-9792-11e6-86df-b260d165c8e6.png) = ALS: rank = 512 Computing time for computeFactors: ![image](https://cloud.githubusercontent.com/assets/9315372/19587680/d542783a-9792-11e6-95a2-c2001eb59853.png) The results shows this patch makes it faster very much when rank is large, but we should reset the two threshold values of "doStack". However, a following problem is that the unit test for this patch will take much time as rank must be larger than 1024. Should I just remove the unit test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15576: [SPARK-17674][SPARKR] check for warning in test output
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15576 Thanks for fixing this! I encountered this issue before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15541 Accidentally, I deleted all my comments. You might need to check the emails to find all my comments. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15579: Added support for extra command in front of spark.
Github user sheepduke commented on the issue: https://github.com/apache/spark/pull/15579 This is rather useful sometimes because you wan tto add some extra tuning arguments like 'numactl'. Otherwise it is not even possible to achieve that. Yes it only works with YARN for now because we have requirements on it. In future more features may be added to other facilities. Any idea or potential improvement? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67318/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)** for PR 15575 at commit [`c5b6626`](https://github.com/apache/spark/commit/c5b6626c144a97054f74adc3158ceea6de3cea58). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15579: Added support for extra command in front of spark.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15579 Please also review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks a lot! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15579: Added support for extra command in front of spark.
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15579 This seems really weird to do (and also only works in YARN). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15576: [SPARK-17674][SPARKR] check for warning in test output
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15576 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67321/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15576: [SPARK-17674][SPARKR] check for warning in test output
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15576 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15576: [SPARK-17674][SPARKR] check for warning in test output
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15576 **[Test build #67321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67321/consoleFull)** for PR 15576 at commit [`1c9bd2e`](https://github.com/apache/spark/commit/1c9bd2ef353802132e33558a61725efdd8bd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15513 Oh, no. I will try to test each when writing the documentation. Please ignore minor incorrectness here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15513 Do we have binary literals anyway? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15513 Then I will do as below: **Literal only** ``` a string literal. a numeric literal that defines ... a binary literal that represents ... For example, ... ``` **Column** ``` a timestamp expression. a binary expression that represents ... a date expression that defines ... For example, ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15513 Sure, sounds great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15579: Added support for extra command in front of spark.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15579 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15579: Added support for extra command in front of spark...
GitHub user sheepduke opened a pull request: https://github.com/apache/spark/pull/15579 Added support for extra command in front of spark. ## What changes were proposed in this pull request? A minor functional change is added into yarn facility to make it possible for users to add some command prefix before spark. Such as "numactl --cpunodebind=1 ...". This makes it useful in some cases to do some optimization work. ## How was this patch tested? Since it is only a minor functional patch, it is re-compiled and tested on our cluster with BigBench to make sure that it works properly. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sheepduke/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15579.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15579 commit a24aff91d59dd848cdd5802602b340db832e5507 Author: U-CCR\daianyue Date: 2016-10-21T03:15:46Z Added support for extra command in front of spark. Such as numactl etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15513 How about "a string literal" vs "a string expression"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15513 Also, will try to consolidate multiple usages and take out `_FUNC_` in extended part too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15513 @rxin, How about this? **Literal only** ``` a string type value. a numeric type value that defines ... a binary type value that defines ... For example, ... ``` **Column** ``` a timestamp type column/value. a binary type column/value that represents ... a date type column/value that represents ... For example, ... ``` (BTW, I am sure you meant to not mention about implicitly casting with knowing this but I am a little bit worried. This was actually taken after Oracle[1]. Will anyway try to take out this but I wanted to just note this just in case.) [1] https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions011.htm#i82074 > any numeric datatype or any nonnumeric datatype that can be implicitly converted to a numeric datatype. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15551: [SPARK-18012][SQL] Simplify WriterContainer
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15551 @rxin : Thanks for notifying me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15569: [SPARK-18029][SQL] PruneFileSourcePartitions should not ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15569 thanks for the review, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15569: [SPARK-18029][SQL] PruneFileSourcePartitions shou...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15569 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15481: [SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSche...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15481 **[Test build #67322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67322/consoleFull)** for PR 15481 at commit [`7bf3bf8`](https://github.com/apache/spark/commit/7bf3bf8606f261e2eb1bebd3c6e3c3ff8600e140). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15481: [SPARK-17929] [CORE] Fix deadlock when CoarseGrainedSche...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15481 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15576: [SPARK-17674][SPARKR] check for warning in test output
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15576 **[Test build #67321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67321/consoleFull)** for PR 15576 at commit [`1c9bd2e`](https://github.com/apache/spark/commit/1c9bd2ef353802132e33558a61725efdd8bd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @yhuai : please see the table in this PRs description. I have added a `comment` (last column) for each entry to point out those cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15575 > I felt that there are numerous places where child's output ordering could be used but the operators don't set it Can you list them at here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15576: [SPARK-17674][SPARKR] check for warning in test output
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15576 Rebased. This should pass now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15576: [WIP][SPARK-17674][SPARKR] check for warning in test out...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15576 Test failure is intentional, it's picking up the following warnings: ``` Warnings --- 1. createDataFrame uses files for large objects (@test_sparkSQL.R#215) - Use Sepal_Length instead of Sepal.Length as column name 2. createDataFrame uses files for large objects (@test_sparkSQL.R#215) - Use Sepal_Width instead of Sepal.Width as column name 3. createDataFrame uses files for large objects (@test_sparkSQL.R#215) - Use Petal_Length instead of Petal.Length as column name 4. createDataFrame uses files for large objects (@test_sparkSQL.R#215) - Use Petal_Width instead of Petal.Width as column name ``` Which is fixed in PR #15560 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15560: [SPARKR] fix warnings
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15560 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15578: Branch 2.0
Github user wankunde closed the pull request at: https://github.com/apache/spark/pull/15578 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15560: [SPARKR] fix warnings
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15560 merged to master and branch-2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)** for PR 15575 at commit [`cfcd2a7`](https://github.com/apache/spark/commit/cfcd2a79231ed16c2a3e82551e87a6de888c2af6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 Agree with the planner behavior described in the last few comments (relevant code : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L165) I have updated the description of the PR with a table which shows the output partitioning and ordering for each implementation of `UnaryExecNode`. It will help with the review in case we are missing something / setting something wrong. `BroadcastExchangeExec `, `CoalesceExec `, `CollectLimitExec `, `ExpandExec`, `TakeOrderedAndProjectExec` and `ShuffleExchange` do not use child's output partitioning. I felt that there are numerous places where child's output ordering could be used but the operators don't set it (thus using default `Nil`). I won't made those modifications in this PR as I want to only do the refac in this diff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14079 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67317/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)** for PR 15575 at commit [`2eee2ee`](https://github.com/apache/spark/commit/2eee2eea81d665706a76a5614d26b4401a07f127). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14079 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15577: [SPARK-18030][Tests] Adds more checks to collect ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15577 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14079 **[Test build #67317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67317/consoleFull)** for PR 14079 at commit [`6b3babc`](https://github.com/apache/spark/commit/6b3babc703668e32f3ec8d5026baa4f2b161a383). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class YarnSchedulerBackendSuite extends SparkFunSuite with MockitoSugar with LocalSparkContext ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15577: [SPARK-18030][Tests] Adds more checks to collect more in...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15577 Thanks! Merging to master and 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15513 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15513 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67314/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15513 **[Test build #67314 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67314/consoleFull)** for PR 15513 at commit [`7841860`](https://github.com/apache/spark/commit/7841860bf50fba8b31da774574ad9818ee678d85). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15513 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67315/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15513 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15513 **[Test build #67315 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67315/consoleFull)** for PR 15513 at commit [`01eecfe`](https://github.com/apache/spark/commit/01eecfe44c5edc3db0d25f43a7ae8f80ca07ac61). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user zhzhan commented on the issue: https://github.com/apache/spark/pull/15541 @rxin Can you please take a look, and let me know if you have any concern? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15513 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15513 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67313/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15513: [SPARK-17963][SQL][Documentation] Add examples (extend) ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15513 **[Test build #67313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67313/consoleFull)** for PR 15513 at commit [`91d2ab5`](https://github.com/apache/spark/commit/91d2ab5174273623819a6b18f0d2d557c54603f7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15578: Branch 2.0
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15578 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15578: Branch 2.0
GitHub user wankunde opened a pull request: https://github.com/apache/spark/pull/15578 Branch 2.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/wankunde/spark branch-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15578.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15578 commit 72d9fba26c19aae73116fd0d00b566967934c6fc Author: WeichenXu Date: 2016-09-22T11:35:54Z [SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for AFTSurvivalRegression ## What changes were proposed in this pull request? Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR. ## How was this patch tested? Existing tests. Author: WeichenXu Closes #14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression. commit 8a02410a92429bff50d6ce082f873cea9e9fa91e Author: Wenchen Fan Date: 2016-09-22T15:25:32Z [SQL][MINOR] correct the comment of SortBasedAggregationIterator.safeProj ## What changes were proposed in this pull request? This comment went stale long time ago, this PR fixes it according to my understanding. ## How was this patch tested? N/A Author: Wenchen Fan Closes #15095 from cloud-fan/update-comment. commit 17b72d31e0c59711eddeb525becb8085930eadcc Author: Dhruve Ashar Date: 2016-09-22T17:10:37Z [SPARK-17365][CORE] Remove/Kill multiple executors together to reduce RPC call time. ## What changes were proposed in this pull request? We are killing multiple executors together instead of iterating over expensive RPC calls to kill single executor. ## How was this patch tested? Executed sample spark job to observe executors being killed/removed with dynamic allocation enabled. Author: Dhruve Ashar Author: Dhruve Ashar Closes #15152 from dhruve/impr/SPARK-17365. commit 9f24a17c59b1130d97efa7d313c06577f7344338 Author: Shivaram Venkataraman Date: 2016-09-22T18:52:42Z Skip building R vignettes if Spark is not built ## What changes were proposed in this pull request? When we build the docs separately we don't have the JAR files from the Spark build in the same tree. As the SparkR vignettes need to launch a SparkContext to be built, we skip building them if JAR files don't exist ## How was this patch tested? To test this we can run the following: ``` build/mvn -DskipTests -Psparkr clean ./R/create-docs.sh ``` You should see a line `Skipping R vignettes as Spark JARs not found` at the end Author: Shivaram Venkataraman Closes #15200 from shivaram/sparkr-vignette-skip. commit 85d609cf25c1da2df3cd4f5d5aeaf3cbcf0d674c Author: Burak Yavuz Date: 2016-09-22T20:05:41Z [SPARK-17613] S3A base paths with no '/' at the end return empty DataFrames ## What changes were proposed in this pull request? Consider you have a bucket as `s3a://some-bucket` and under it you have files: ``` s3a://some-bucket/file1.parquet s3a://some-bucket/file2.parquet ``` Getting the parent path of `s3a://some-bucket/file1.parquet` yields `s3a://some-bucket/` and the ListingFileCatalog uses this as the key in the hash map. When catalog.allFiles is called, we use `s3a://some-bucket` (no slash at the end) to get the list of files, and we're left with an empty list! This PR fixes this by adding a `/` at the end of the `URI` iff the given `Path` doesn't have a parent, i.e. is the root. This is a no-op if the path already had a `/` at the end, and is handled through the Hadoop Path, path merging semantics. ## How was this patch tested? Unit test in `FileCatalogSuite`. Author: Burak Yavuz Closes #15169 from brkyvz/SPARK-17613. commit 3cdae0ff2f45643df7bc198cb48623526c7eb1a6 Author: Shixiong Zhu Date: 2016-09-22T21:26:45Z [SPARK-17638][STREAMING] Stop JVM StreamingContext when the Python process is dead ## What changes were proposed in this pull request? When the Python process is dead, the JVM StreamingContext is still running. Hence we will see a lot of Py4jException before the JVM process exits. It's better to stop the JVM StreamingContext to avoid those annoying logs. ## How was this patch tested? Jenkins Author: Shixiong Zhu Closes #15201 from zsxwing/stop-jvm-ssc. commit 0d634875026ccf1eaf984996e9460d7673561f80 Author: Herman van Hovell Date: 2016-09-22T21:29:27Z [SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate ## What changes were proposed in this pull request?
[GitHub] spark issue #14957: [SPARK-4502][SQL]Support parquet nested struct pruning a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14957 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67316/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14957: [SPARK-4502][SQL]Support parquet nested struct pruning a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14957 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 Yeah. So I don't see any exception that an unary node, if it is not `ShuffleExchange`, can have an `outputPartitioning` other than `child.outputPartitioning`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14957: [SPARK-4502][SQL]Support parquet nested struct pruning a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14957 **[Test build #67316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67316/consoleFull)** for PR 14957 at commit [`23465ba`](https://github.com/apache/spark/commit/23465babd0f60db8a79e70f0589af2ec5bf360eb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15575 Our planner decides if to add an `ShuffleExchange` by consider `outputPartitioning` and `requiredDistribution` together. If the `outputPartitioning` of the child does not satisfy the `requiredDistribution` of the parent, we will add a `ShuffleExchange` operator. When we say `child`, it is literally the child of a node. It does not matter if the child is a `ShuffleExchange` or not. For a node, when its `outputPartitioning` is `child. outputPartitioning`, this node does not shuffle data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15566: [SPARK-18026][SQL] should not always lowercase partition...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15566 Sorry I was wrong. The root cause of SPARK-17990 is that, when we write data for a hive table, we ignore the truth that hive table is not case-preserving, and create the partition directory with the case-preserving partition columns. I think it's not related to this PR and we should fix it with a new PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15556: [SPARK-18010][Core] Reduce work performed for bui...
Github user vijoshi commented on a diff in the pull request: https://github.com/apache/spark/pull/15556#discussion_r84413566 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala --- @@ -43,38 +43,56 @@ private[spark] class ReplayListenerBus extends SparkListenerBus with Logging { * @param sourceName Filename (or other source identifier) from whence @logData is being read * @param maybeTruncated Indicate whether log file might be truncated (some abnormal situations *encountered, log file might not finished writing) or not + * @param eventsFilter Filter function to select JSON event strings in the log data stream that + *should be parsed and replayed. When not specified, all event strings in the log data + *are parsed and replayed. */ def replay( logData: InputStream, sourceName: String, - maybeTruncated: Boolean = false): Unit = { -var currentLine: String = null -var lineNumber: Int = 1 + maybeTruncated: Boolean = false, + eventsFilter: (String) => Boolean = ReplayListenerBus.SELECT_ALL_FILTER): Unit = { try { - val lines = Source.fromInputStream(logData).getLines() - while (lines.hasNext) { -currentLine = lines.next() + val lineEntries = Source.fromInputStream(logData) +.getLines() +.zipWithIndex +.filter(entry => eventsFilter(entry._1)) + + var entry: (String, Int) = ("", 0) + + while (lineEntries.hasNext) { try { - postToAll(JsonProtocol.sparkEventFromJson(parse(currentLine))) + entry = lineEntries.next() + postToAll(JsonProtocol.sparkEventFromJson(parse(entry._1))) } catch { case jpe: JsonParseException => // We can only ignore exception from last line of the file that might be truncated -if (!maybeTruncated || lines.hasNext) { +// the last entry may not be the very last line in the event log, but we treat it +// as such in a best effort to replay the given input +if (!maybeTruncated || lineEntries.hasNext) { throw jpe } else { logWarning(s"Got JsonParseException from log file $sourceName" + -s" at line $lineNumber, the file might not have finished writing cleanly.") +s" at line number ${entry._2}, the file might not have finished writing cleanly.") } + + case e: Exception => --- End diff -- needed to have the line number at which failure happened, since we had that in the earlier implementation. the line/line no values are not in scope in the outer catch so can't get them logged there. open for a better way to do this if any. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13507: [SPARK-15765][SQL][Streaming] Make continuous Par...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/13507 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13507: [SPARK-15765][SQL][Streaming] Make continuous Parquet wr...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13507 I'm closing this in favor of SPARK-17924, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org