[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121797911 [Test build #37435 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37435/console) for PR 7434 at commit [`2225331`](https://github.com/apache/spark/commit/2225331ea36e4a39f097e53afead6930e3cb0ed5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute ` * `abstract class Star extends LeafExpression with NamedExpression ` * `case class UnresolvedAlias(child: Expression) extends UnaryExpression with NamedExpression ` * `abstract class LeafExpression extends Expression ` * `abstract class UnaryExpression extends Expression ` * `abstract class BinaryExpression extends Expression ` * `case class SortOrder(child: Expression, direction: SortDirection) extends UnaryExpression ` * `trait AggregateExpression extends Expression ` * `trait PartialAggregate extends AggregateExpression ` * `case class Min(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Max(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Count(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Average(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Sum(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class SumDistinct(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class First(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Last(child: Expression) extends UnaryExpression with PartialAggregate ` * `trait Generator extends Expression ` * `case class Explode(child: Expression) extends UnaryExpression with Generator ` * `trait NamedExpression extends Expression ` * `abstract class Attribute extends LeafExpression with NamedExpression ` * `case class PrettyAttribute(name: String) extends Attribute ` * `abstract class LeafNode extends LogicalPlan ` * `abstract class UnaryNode extends LogicalPlan ` * `abstract class BinaryNode extends LogicalPlan ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121797926 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8882][SPARK-5681][Streaming]Add a new R...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/7276#discussion_r34752101 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala --- @@ -182,4 +182,5 @@ private[streaming] class ReceiverSupervisorImpl( logDebug(sCleaning up blocks older then $cleanupThreshTime) receivedBlockHandler.cleanupOldBlocks(cleanupThreshTime.milliseconds) } + --- End diff -- nit: extra line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8600] [ML] Naive Bayes API for spark.ml...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7284#issuecomment-121816137 [Test build #37450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37450/consoleFull) for PR 7284 at commit [`c3de687`](https://github.com/apache/spark/commit/c3de6874b6b7a73e652cb129d0bb18327594f32f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9018][MLLIB] add stopwatches
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7415#issuecomment-121816166 [Test build #37449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37449/consoleFull) for PR 7415 at commit [`40b4347`](https://github.com/apache/spark/commit/40b43476dafcd42a562027740f4efe7089d0efd4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8882][SPARK-5681][Streaming]Add a new R...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/7276#discussion_r34752882 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala --- @@ -71,13 +82,57 @@ class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false ) private val listenerBus = ssc.scheduler.listenerBus + /** Enumeration to identify current state of the ReceiverTracker */ + object TrackerState extends Enumeration { +type CheckpointState = Value +val Initialized, Started, Stopping, Stopped = Value + } + import TrackerState._ + + /** State of the tracker. Protected by trackerStateLock */ + private var trackerState = Initialized + + /** trackerStateLock is used to protect reading/writing trackerState */ + private val trackerStateLock = new AnyRef + // endpoint is created when generator starts. // This not being null means the tracker has been started and not stopped private var endpoint: RpcEndpointRef = null + private val schedulingPolicy: ReceiverSchedulingPolicy = +new LoadBalanceReceiverSchedulingPolicyImpl() + + /** + * Track receivers' status for scheduling + */ + private val receiverTrackingInfos = new HashMap[Int, ReceiverTrackingInfo] + + /** + * Store all preferred locations for all receivers. We need this information to schedule receivers + */ + private val receiverPreferredLocations = new HashMap[Int, Option[String]] + + /** Use a separate lock to avoid dead-lock */ + private val receiverTrackingInfosLock = new AnyRef + + /** Check if tracker has been marked for starting */ + private def isTrackerStarted(): Boolean = trackerStateLock.synchronized { --- End diff -- nit: Please move these helper methods lower in the class after ` hasUnallocatedBlocks` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121821459 [Test build #37457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37457/consoleFull) for PR 7413 at commit [`18c2208`](https://github.com/apache/spark/commit/18c2208f57b9f99c42b26e9fae849da52c2a05df). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121821781 [Test build #37457 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37457/console) for PR 7413 at commit [`18c2208`](https://github.com/apache/spark/commit/18c2208f57b9f99c42b26e9fae849da52c2a05df). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9018][MLLIB] add stopwatches
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7415#issuecomment-121821799 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121821788 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8245][SQL] FormatNumber/Length Support ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7034#issuecomment-121821688 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8925] [MLlib] Add @since tags to mllib....
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7436#issuecomment-121821634 @sthota2014 You don't need to tag private or package private method. We only need `@since` on public methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9018][MLLIB] add stopwatches
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7415#issuecomment-121821637 [Test build #37449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37449/console) for PR 7415 at commit [`40b4347`](https://github.com/apache/spark/commit/40b43476dafcd42a562027740f4efe7089d0efd4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9022] [SQL] Generated projections for U...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7437#issuecomment-121825079 @davies doesn't need to be part of this pr, but can you think about how we can do codegen testing with this new Unsafe project? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7119][SQL]Give script a default serde w...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6638#issuecomment-121827751 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9058][SQL] Split projectionCode if it i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7418#issuecomment-121827685 [Test build #37461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37461/consoleFull) for PR 7418 at commit [`12d3794`](https://github.com/apache/spark/commit/12d3794b009a90d21de9a1d52d4f3ea9503f2b58). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8245][SQL] FormatNumber/Length Support ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7034 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9058][SQL] Split projectionCode if it i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7418#issuecomment-121827606 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9058][SQL] Split projectionCode if it i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7418#issuecomment-121827618 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7119][SQL]Give script a default serde w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6638#issuecomment-121827741 [Test build #25 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/25/console) for PR 6638 at commit [`2ee0488`](https://github.com/apache/spark/commit/2ee048825ad79a6a533ead969752b435af92166a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4352][YARN][WIP] Incorporate locality p...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/6394#discussion_r34756031 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -872,6 +872,25 @@ class DAGScheduler( // will be posted, which should always come after a corresponding SparkListenerStageSubmitted // event. stage.latestInfo = StageInfo.fromStage(stage, Some(partitionsToCompute.size)) +val taskIdToLocations = try { + stage match { +case s: ShuffleMapStage = + partitionsToCompute.map { id = (id, getPreferredLocs(stage.rdd, id))}.toMap +case s: ResultStage = + val job = s.resultOfJob.get + partitionsToCompute.map { id = +val p = job.partitions(id) +(id, getPreferredLocs(stage.rdd, p)) + }.toMap + } +} catch { + case NonFatal(e) = +abortStage(stage, sTask creation failed: $e\n${e.getStackTraceString}) +runningStages -= stage +return +} +stage.latestInfo.taskLocalityPreferences = Some(taskIdToLocations.values.toSeq) --- End diff -- Thanks @kayousterhout , that's a good idea, I will change the code accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-121830731 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-121830665 [Test build #37454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37454/console) for PR 7353 at commit [`ca1ae69`](https://github.com/apache/spark/commit/ca1ae69c1baa7d4d14946bdd2638aec47e05be86). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8280][SPARK-8281][SQL]Handle NaN, null ...
Github user yijieshen commented on the pull request: https://github.com/apache/spark/pull/6835#issuecomment-121798697 ok, will do soon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7412#discussion_r34750033 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -82,20 +84,70 @@ class PrefixSpan private ( logWarning(Input data is not cached.) } val minCount = getMinCount(sequences) -val lengthOnePatternsAndCounts = - getFreqItemAndCounts(minCount, sequences).collect() -val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase( - lengthOnePatternsAndCounts.map(_._1), sequences) -val groupedProjectedDatabase = prefixAndProjectedDatabase - .map(x = (x._1.toSeq, x._2)) - .groupByKey() - .map(x = (x._1.toArray, x._2.toArray)) -val nextPatterns = getPatternsInLocal(minCount, groupedProjectedDatabase) -val lengthOnePatternsAndCountsRdd = - sequences.sparkContext.parallelize( -lengthOnePatternsAndCounts.map(x = (Array(x._1), x._2))) -val allPatterns = lengthOnePatternsAndCountsRdd ++ nextPatterns -allPatterns +val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, sequences) +val prefixSuffixPairs = getPrefixSuffixPairs( + lengthOnePatternsAndCounts.map(_._1).collect(), sequences) +var patternsCount: Long = lengthOnePatternsAndCounts.count() +var allPatternAndCounts = lengthOnePatternsAndCounts.map(x = (Array(x._1), x._2)) +var currentPrefixSuffixPairs = prefixSuffixPairs +while (patternsCount = minPatternsBeforeShuffle currentPrefixSuffixPairs.count() != 0) { + val (nextPatternAndCounts, nextPrefixSuffixPairs) = +getPatternCountsAndPrefixSuffixPairs(minCount, currentPrefixSuffixPairs) + patternsCount = nextPatternAndCounts.count().toInt + currentPrefixSuffixPairs = nextPrefixSuffixPairs + allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts +} +if (patternsCount 0) { + val projectedDatabase = currentPrefixSuffixPairs +.map(x = (x._1.toSeq, x._2)) +.groupByKey() +.map(x = (x._1.toArray, x._2.toArray)) + val nextPatternAndCounts = getPatternsInLocal(minCount, projectedDatabase) + allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts +} +allPatternAndCounts + } + + /** + * Get the pattern and counts, and prefix suffix pairs + * @param minCount minimum count + * @param prefixSuffixPairs prefix and suffix pairs, + * @return pattern and counts, and prefix suffix pairs + * (Array[pattern, count], RDD[prefix, suffix ]) + */ + private def getPatternCountsAndPrefixSuffixPairs( + minCount: Long, + prefixSuffixPairs: RDD[(Array[Int], Array[Int])]): + (RDD[(Array[Int], Long)], RDD[(Array[Int], Array[Int])]) = { +val prefixAndFreqentItemAndCounts = prefixSuffixPairs + .flatMap { case (prefix, suffix) = + suffix.distinct.map(y = ((prefix.toSeq, y), 1L)) +}.reduceByKey(_ + _) + .filter(_._2 = minCount) +val patternAndCounts = prefixAndFreqentItemAndCounts + .map{ case ((prefix, item), count) = (prefix.toArray :+ item, count) } +val prefixlength = prefixSuffixPairs.first()._1.length +if (prefixlength + 1 = maxPatternLength) { --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8103][core] DAGScheduler should not sub...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/6750#issuecomment-121805382 @kayousterhout I don't think that will work, but maybe I'm not seeing it. I think the problem is, you still need some way get a handle on the zombie TaskSetManager to be able to call allTasksInTaskSetFinished(). Right now, taskSetFinished is ultimately getting a handle on that TaskSetManager by looking it up in activeTaskSets [in `statusUpdate()`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L328). So if those zombie task sets aren't in activeTaskSets, it seems like you'd still need to keep track of them *somewhere* in TaskSchedulerImpl. I feel like part of the problem is that active task sets is somewhat vague. You might not expect it to contain task sets that have already failed (from a fetch failed), but still happen to have tasks running. I guess zombie is vague too, but in a way that is better since you aren't tricked into thinking you know what it means. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6284][MESOS] Add mesos role, principal ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4960#issuecomment-121811703 [Test build #37431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37431/console) for PR 4960 at commit [`0f9f03e`](https://github.com/apache/spark/commit/0f9f03e2ccd822aaa8939b8f7d5828e72ba88f11). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121811695 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-121811701 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121811827 [Test build #37447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37447/consoleFull) for PR 7413 at commit [`dbb33a5`](https://github.com/apache/spark/commit/dbb33a5abd87828573b569e7002c18f4313e4c5d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6284][MESOS] Add mesos role, principal ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4960#issuecomment-121811758 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8882][SPARK-5681][Streaming]Add a new R...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/7276#discussion_r34752114 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverSchedulingPolicy.scala --- @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.scheduler + +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.streaming.scheduler.ReceiverState._ + +private[streaming] case class ReceiverTrackingInfo( --- End diff -- Please provide scala docs for this class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121811707 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-121811713 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/7417#discussion_r34749244 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala --- @@ -34,7 +34,15 @@ case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNod val leftResults = left.execute().map(_.copy()) val rightResults = right.execute().map(_.copy()) -leftResults.cartesian(rightResults).mapPartitions { iter = +val cartesianRdd = if (leftResults.partitions.size rightResults.partitions.size) { + rightResults.cartesian(leftResults).mapPartitions { iter = +iter.map(tuple = (tuple._2, tuple._1)) + } +} else { + leftResults.cartesian(rightResults) +} + +cartesianRdd.mapPartitions { iter = val joinedRow = new JoinedRow --- End diff -- yes, use partition size here is not accurate, see a rdd with 100 partitions, and each partition has one record and a rdd with 10 partition and each partition has 100 million records, use the method above will cause more scan from hdfs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121799959 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9026] Refactor SimpleFutureAction.onCom...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/7385#discussion_r34749845 --- Diff: core/src/test/scala/org/apache/spark/FutureActionSuite.scala --- @@ -49,4 +50,20 @@ class FutureActionSuite job.jobIds.size should be (2) } + test(simple async action callbacks should not tie up execution context threads (SPARK-9026)) { +val rdd = sc.parallelize(1 to 10, 2).map(_ = Thread.sleep(1000 * 1000)) +val pool = ThreadUtils.newDaemonCachedThreadPool(SimpleFutureActionTest) +val executionContext = ExecutionContext.fromExecutorService(pool) +val job = rdd.countAsync() +try { + for (_ - 1 to 10) { +job.onComplete(_ = ())(executionContext) +assert(pool.getLargestPoolSize 10) --- End diff -- This looks flaky. Even they are non blocking, there is NO guarantee that one of the 10 scheduled function `_ = ()` will finish by the end of this loop. So it may happen that in the 10th iteration, the previous 9 scheduled function are still not finished, the 10th on gets scheduled, and therefore the pool size = 10. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8245][SQL] FormatNumber/Length Support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7034#issuecomment-121809587 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8245][SQL] FormatNumber/Length Support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7034#issuecomment-121809618 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8882][SPARK-5681][Streaming]Add a new R...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/7276#discussion_r34752381 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverSchedulingPolicy.scala --- @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.scheduler + +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.streaming.scheduler.ReceiverState._ + +private[streaming] case class ReceiverTrackingInfo( +receiverId: Int, +state: ReceiverState, +scheduledLocations: Option[Seq[String]], +runningLocation: Option[String]) + +private[streaming] trait ReceiverSchedulingPolicy { + + /** + * Return a list of candidate executors to run the receiver. If the list is empty, the caller can + * run this receiver in arbitrary executor. + */ + def scheduleReceiver( + receiverId: Int, + preferredLocation: Option[String], + receiverTrackingInfoMap: Map[Int, ReceiverTrackingInfo], + executors: Seq[String]): Seq[String] +} + +/** + * A ReceiverScheduler trying to balance executors' load. Here is the approach to schedule executors + * for a receiver. + * ol + * li + * If preferredLocation is set, preferredLocation should be one of the candidate executors. + * /li + * li + * Every executor will be assigned to a weight according to the receivers running or scheduling + * on it. + * ul + * li + * If a receiver is running on an executor, it contributes 1.0 to the executor's weight. + * /li + * li + * If a receiver is scheduled to an executor but has not yet run, it contributes + * `1.0 / #candidate_executors_of_this_receiver` to the executor's weight./li + * /ul + * At last, if there are more than 3 idle executors (weight = 0), returns all idle executors. + * Otherwise, we only return 3 best options according to the weights. + * /li + * /ol + * + */ +private[streaming] class LoadBalanceReceiverSchedulingPolicyImpl extends ReceiverSchedulingPolicy { + + def scheduleReceiver( + receiverId: Int, + preferredLocation: Option[String], + receiverTrackingInfoMap: Map[Int, ReceiverTrackingInfo], + executors: Seq[String]): Seq[String] = { +if (executors.isEmpty) { + return Seq.empty +} + +// Always try to schedule to the preferred locations +val locations = mutable.Set[String]() +locations ++= preferredLocation + +val executorWeights = receiverTrackingInfoMap.filter { case (id, _) = + // Ignore the receiver to be scheduled. It may be still running. + id != receiverId +}.values.flatMap { receiverTrackingInfo = + receiverTrackingInfo.state match { +case ReceiverState.INACTIVE = Nil +case ReceiverState.SCHEDULED = + val scheduledLocations = receiverTrackingInfo.scheduledLocations.get + // The probability that a scheduled receiver will run in an executor is + // 1.0 / scheduledLocations.size + scheduledLocations.map(location = location - 1.0 / scheduledLocations.size) --- End diff -- put `1.0 / scheduledLocations.size` in parenthesis, becomes easier to read. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8882][SPARK-5681][Streaming]Add a new R...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/7276#discussion_r34752341 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverSchedulingPolicy.scala --- @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.scheduler + +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.streaming.scheduler.ReceiverState._ + +private[streaming] case class ReceiverTrackingInfo( +receiverId: Int, +state: ReceiverState, +scheduledLocations: Option[Seq[String]], +runningLocation: Option[String]) + +private[streaming] trait ReceiverSchedulingPolicy { + + /** + * Return a list of candidate executors to run the receiver. If the list is empty, the caller can + * run this receiver in arbitrary executor. + */ + def scheduleReceiver( + receiverId: Int, + preferredLocation: Option[String], + receiverTrackingInfoMap: Map[Int, ReceiverTrackingInfo], + executors: Seq[String]): Seq[String] +} + +/** + * A ReceiverScheduler trying to balance executors' load. Here is the approach to schedule executors + * for a receiver. + * ol + * li + * If preferredLocation is set, preferredLocation should be one of the candidate executors. + * /li + * li + * Every executor will be assigned to a weight according to the receivers running or scheduling + * on it. + * ul + * li + * If a receiver is running on an executor, it contributes 1.0 to the executor's weight. + * /li + * li + * If a receiver is scheduled to an executor but has not yet run, it contributes + * `1.0 / #candidate_executors_of_this_receiver` to the executor's weight./li + * /ul + * At last, if there are more than 3 idle executors (weight = 0), returns all idle executors. + * Otherwise, we only return 3 best options according to the weights. + * /li + * /ol + * + */ +private[streaming] class LoadBalanceReceiverSchedulingPolicyImpl extends ReceiverSchedulingPolicy { + + def scheduleReceiver( + receiverId: Int, + preferredLocation: Option[String], + receiverTrackingInfoMap: Map[Int, ReceiverTrackingInfo], + executors: Seq[String]): Seq[String] = { +if (executors.isEmpty) { + return Seq.empty +} + +// Always try to schedule to the preferred locations +val locations = mutable.Set[String]() +locations ++= preferredLocation + +val executorWeights = receiverTrackingInfoMap.filter { case (id, _) = + // Ignore the receiver to be scheduled. It may be still running. --- End diff -- What does this mean It may be still running? Can you elaborate that case, when and how that can happen? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9058][SQL] Split projectionCode if it i...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/7418#discussion_r34752387 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala --- @@ -45,10 +45,32 @@ object GenerateMutableProjection extends CodeGenerator[Seq[Expression], () = Mu else ${ctx.setColumn(mutableRow, e.dataType, i, evaluationCode.primitive)}; -}.mkString(\n) +} + +val projectionCodeSegments = projectionCodes.grouped(50).toSeq.map(_.mkString(\n)) + +val (projectionCode, projectionFuncs) = if (projectionCodeSegments.length == 1) { + (projectionCodeSegments(0), ) +} else { + val pCode = (0 until projectionCodeSegments.length).map { i = +sprojectSeg$i(_i); + }.mkString(\n) + + val pFuncs = (0 until projectionCodeSegments.length).map { i = +s + public void projectSeg$i(Object _i) { --- End diff -- Since the codegen aim to inline the execution, probably we'd better not to increase the overhead for type casting. Even, we'd better to put every 50(says) expressions into a single codegen function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121818364 [Test build #37440 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37440/console) for PR 7434 at commit [`3135a8b`](https://github.com/apache/spark/commit/3135a8b9edaf52aba17c9028f1672334e793456d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute ` * `abstract class Star extends LeafExpression with NamedExpression ` * `case class UnresolvedAlias(child: Expression) extends UnaryExpression with NamedExpression ` * `case class SortOrder(child: Expression, direction: SortDirection) extends UnaryExpression ` * `trait AggregateExpression extends Expression ` * `trait PartialAggregate extends AggregateExpression ` * `case class Min(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Max(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Count(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Average(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Sum(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class SumDistinct(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class First(child: Expression) extends UnaryExpression with PartialAggregate ` * `case class Last(child: Expression) extends UnaryExpression with PartialAggregate ` * `trait Generator extends Expression ` * `case class Explode(child: Expression) extends UnaryExpression with Generator ` * `trait NamedExpression extends Expression ` * `abstract class Attribute extends LeafExpression with NamedExpression ` * `case class PrettyAttribute(name: String) extends Attribute ` * `abstract class LeafNode extends LogicalPlan ` * `abstract class UnaryNode extends LogicalPlan ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8600] [ML] Naive Bayes API for spark.ml...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7284#issuecomment-121822432 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9018][MLLIB] add stopwatches
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7415#issuecomment-121822827 Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8600] [ML] Naive Bayes API for spark.ml...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7284#issuecomment-121822255 [Test build #37450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37450/console) for PR 7284 at commit [`c3de687`](https://github.com/apache/spark/commit/c3de6874b6b7a73e652cb129d0bb18327594f32f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class NaiveBayes(override val uid: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9060] [SQL] Revert SPARK-8359, SPARK-88...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7426 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121823870 [Test build #37458 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37458/consoleFull) for PR 7434 at commit [`9e8a4de`](https://github.com/apache/spark/commit/9e8a4def6f02e03899fa2fafdd2841c513d280af). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8245][SQL] FormatNumber/Length Support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7034#issuecomment-121825914 [Test build #37443 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37443/console) for PR 7034 at commit [`e534b87`](https://github.com/apache/spark/commit/e534b87a125d264123216025d16d61da327f837d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Length(child: Expression) extends UnaryExpression with ExpectsInputTypes ` * `case class FormatNumber(x: Expression, d: Expression)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8245][SQL] FormatNumber/Length Support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7034#issuecomment-121825948 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9026] Refactor SimpleFutureAction.onCom...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7385#issuecomment-121838283 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9026] Refactor SimpleFutureAction.onCom...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7385#issuecomment-121838243 **[Test build #37442 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37442/console)** for PR 7385 at commit [`c6fdc21`](https://github.com/apache/spark/commit/c6fdc2169f5bb8802b7b2d0019433de8bb0cae66) after a configured wait of `175m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8964] [SQL] [WIP] Use Exchange to perfo...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/7334#issuecomment-121838166 @JoshRosen ,seems `execution.CollectLimit` will eventually invoke the code like (in SparkPlan.executeTake): ```scala sc.runJob(childRDD, (it: Iterator[InternalRow]) = it.take(left).toArray, p, allowLocal = false) ``` I am wondering if `execution.CollectLimit(limit, planLater(child))` V.S. `execution.Limit(global = true, limit, execution.Limit(global=false, limit, child))` are actually equals in data shuffling / copying, if so, probably we can simplify the code by removing the `CollectLimit` and `ReturnAnswer`. Sorry if I missed something. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1855] Local checkpointing
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-121840805 [Test build #1081 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1081/consoleFull) for PR 7279 at commit [`a92657d`](https://github.com/apache/spark/commit/a92657d815e7837a64d69546acc954a792ae1d1a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1855] Local checkpointing
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-121840800 [Test build #1080 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1080/consoleFull) for PR 7279 at commit [`a92657d`](https://github.com/apache/spark/commit/a92657d815e7837a64d69546acc954a792ae1d1a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1855] Local checkpointing
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-121840775 [Test build #1079 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1079/consoleFull) for PR 7279 at commit [`a92657d`](https://github.com/apache/spark/commit/a92657d815e7837a64d69546acc954a792ae1d1a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4352][YARN][WIP] Incorporate locality p...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/6394#discussion_r34757572 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala --- @@ -24,11 +24,15 @@ package org.apache.spark private[spark] trait ExecutorAllocationClient { /** - * Express a preference to the cluster manager for a given total number of executors. + * Express a preference to the cluster manager for a given total number of executors, + * number of locality aware pending tasks and related locality preferences. * This can result in canceling pending requests or filing additional requests. * @return whether the request is acknowledged by the cluster manager. */ - private[spark] def requestTotalExecutors(numExecutors: Int): Boolean + private[spark] def requestTotalExecutors( + numExecutors: Int, + localityAwarePendingTasks: Int, + preferredLocalityToCount: Map[String, Int]): Boolean --- End diff -- Actually I think the key string is hostname, not executor :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8160][SQL]Support using external sortin...
Github user lianhuiwang commented on the pull request: https://github.com/apache/spark/pull/6875#issuecomment-121798597 ok,@JoshRosen, thanks, i close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8280][SPARK-8281][SQL]Handle NaN, null ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/6835#issuecomment-121798590 Not at all. Feel free to close this one and submit a new one. . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...
Github user XuTingjun commented on the pull request: https://github.com/apache/spark/pull/6817#issuecomment-121798632 @andrewor14 , Sorry to bother you again. I think it's really a bug, wish you have a look again, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121810908 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9058][SQL] Split projectionCode if it i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7418#issuecomment-121810983 [Test build #37444 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37444/consoleFull) for PR 7418 at commit [`7435454`](https://github.com/apache/spark/commit/7435454ae5aef0819a3c7498e0b4f191a43cb752). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121811072 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9030][STREAMING][WIP] Add Kinesis.creat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7413#issuecomment-121811087 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7412#discussion_r34752412 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -82,20 +84,70 @@ class PrefixSpan private ( logWarning(Input data is not cached.) } val minCount = getMinCount(sequences) -val lengthOnePatternsAndCounts = - getFreqItemAndCounts(minCount, sequences).collect() -val prefixAndProjectedDatabase = getPrefixAndProjectedDatabase( - lengthOnePatternsAndCounts.map(_._1), sequences) -val groupedProjectedDatabase = prefixAndProjectedDatabase - .map(x = (x._1.toSeq, x._2)) - .groupByKey() - .map(x = (x._1.toArray, x._2.toArray)) -val nextPatterns = getPatternsInLocal(minCount, groupedProjectedDatabase) -val lengthOnePatternsAndCountsRdd = - sequences.sparkContext.parallelize( -lengthOnePatternsAndCounts.map(x = (Array(x._1), x._2))) -val allPatterns = lengthOnePatternsAndCountsRdd ++ nextPatterns -allPatterns +val lengthOnePatternsAndCounts = getFreqItemAndCounts(minCount, sequences) +val prefixSuffixPairs = getPrefixSuffixPairs( + lengthOnePatternsAndCounts.map(_._1).collect(), sequences) +var patternsCount: Long = lengthOnePatternsAndCounts.count() +var allPatternAndCounts = lengthOnePatternsAndCounts.map(x = (Array(x._1), x._2)) +var currentPrefixSuffixPairs = prefixSuffixPairs +while (patternsCount = minPatternsBeforeShuffle currentPrefixSuffixPairs.count() != 0) { + val (nextPatternAndCounts, nextPrefixSuffixPairs) = +getPatternCountsAndPrefixSuffixPairs(minCount, currentPrefixSuffixPairs) + patternsCount = nextPatternAndCounts.count().toInt + currentPrefixSuffixPairs = nextPrefixSuffixPairs + allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts +} +if (patternsCount 0) { + val projectedDatabase = currentPrefixSuffixPairs +.map(x = (x._1.toSeq, x._2)) +.groupByKey() +.map(x = (x._1.toArray, x._2.toArray)) + val nextPatternAndCounts = getPatternsInLocal(minCount, projectedDatabase) + allPatternAndCounts = allPatternAndCounts ++ nextPatternAndCounts +} +allPatternAndCounts + } + + /** + * Get the pattern and counts, and prefix suffix pairs + * @param minCount minimum count + * @param prefixSuffixPairs prefix and suffix pairs, + * @return pattern and counts, and prefix suffix pairs + * (Array[pattern, count], RDD[prefix, suffix ]) + */ + private def getPatternCountsAndPrefixSuffixPairs( + minCount: Long, + prefixSuffixPairs: RDD[(Array[Int], Array[Int])]): + (RDD[(Array[Int], Long)], RDD[(Array[Int], Array[Int])]) = { +val prefixAndFreqentItemAndCounts = prefixSuffixPairs + .flatMap { case (prefix, suffix) = + suffix.distinct.map(y = ((prefix.toSeq, y), 1L)) +}.reduceByKey(_ + _) + .filter(_._2 = minCount) +val patternAndCounts = prefixAndFreqentItemAndCounts + .map{ case ((prefix, item), count) = (prefix.toArray :+ item, count) } --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7119][SQL]Give script a default serde w...
Github user zhichao-li commented on the pull request: https://github.com/apache/spark/pull/6638#issuecomment-121814588 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-121817147 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-121817328 [Test build #37454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37454/consoleFull) for PR 7353 at commit [`ca1ae69`](https://github.com/apache/spark/commit/ca1ae69c1baa7d4d14946bdd2638aec47e05be86). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8998][MLlib] Collect enough frequent pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7412#issuecomment-121816926 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8995][SQL] cast date strings like '2015...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7353#issuecomment-121816935 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9022] [SQL] Generated projections for U...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/7437 [SPARK-9022] [SQL] Generated projections for UnsafeRow Added two projections: GenerateUnsafeProjection and FromUnsafeProjection, which could be used to convert UnsafeRow from/to GenericInternalRow. They will re-use the buffer during projection, similar to MutableProjection (without all the interface MutableProjection has). cc @rxin @JoshRosen You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark unsafe_proj2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7437.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7437 commit 5a2637347a2f96d55a17b4c866bccfc40b654ffc Author: Davies Liu dav...@databricks.com Date: 2015-07-16T03:30:19Z unsafe projections --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7381 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-121819633 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8600] [ML] Naive Bayes API for spark.ml...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7284#discussion_r34753646 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala --- @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.mllib.classification.NaiveBayesSuite._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.Row + +class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext { + + def validatePrediction(predictionAndLabels: DataFrame): Unit = { +val numOfErrorPredictions = predictionAndLabels.collect().count { + case Row(prediction: Double, label: Double) = +prediction != label +} +// At least 80% of the predictions should be on. +assert(numOfErrorPredictions predictionAndLabels.count() / 5) + } + + def validateModelFit( + piData: Vector, + thetaData: Matrix, + model: NaiveBayesModel): Unit = { +assert(Vectors.dense(model.pi.toArray.map(math.exp)) ~== + Vectors.dense(piData.toArray.map(math.exp)) absTol 0.05, pi mismatch) --- End diff -- waiting for #7357 to be merged and we can directly compare two mapped vector like this ``` assert(model.pi.map(math.exp) ~== piData.map(math.exp) absTol 0.05, pi mismatch) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-121819664 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121823525 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8968] [SQL] shuffled by the partition c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7336#issuecomment-121823524 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8968] [SQL] shuffled by the partition c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7336#issuecomment-121823620 [Test build #37459 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37459/consoleFull) for PR 7336 at commit [`b5ada0a`](https://github.com/apache/spark/commit/b5ada0ab4944661c8ab6bf030006d111657d13e6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9022] [SQL] Generated projections for U...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7437#discussion_r34754330 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java --- @@ -19,11 +19,11 @@ import java.io.IOException; +import com.google.common.annotations.VisibleForTesting; --- End diff -- import order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121823508 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8968] [SQL] shuffled by the partition c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7336#issuecomment-121823512 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8792] [ML] Add Python API for PCA trans...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7190#issuecomment-121825596 [Test build #37460 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37460/console) for PR 7190 at commit [`8f4ac31`](https://github.com/apache/spark/commit/8f4ac31f8a772ea3016a9614e986cbc3c0bb4468). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class PCA(JavaEstimator, HasInputCol, HasOutputCol):` * `class PCAModel(JavaModel):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9018][MLLIB] add stopwatches
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7415#discussion_r34755425 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/StopwatchSuite.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import org.apache.spark.SparkFunSuite +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class StopwatchSuite extends SparkFunSuite with MLlibTestSparkContext { + + private def testStopwatchOnDriver(sw: Stopwatch): Unit = { +assert(sw.name === sw) +assert(sw.elapsed() === 0L) +assert(!sw.isRunning) +intercept[AssertionError] { + sw.stop() +} +sw.start() +Thread.sleep(50) +val duration = sw.stop() +assert(duration = 50 duration 100) // using a loose upper bound +val elapsed = sw.elapsed() +assert(elapsed === duration) +sw.start() +Thread.sleep(50) +val duration2 = sw.stop() +assert(duration2 = 50 duration2 100) +val elapsed2 = sw.elapsed() +assert(elapsed2 == duration + duration2) --- End diff -- Should we no longer bother with this? Or is it just for Longs (in which case enforcing consistency may be easiest)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8125] [SQL] Accelerates Parquet schema ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7396#issuecomment-121827555 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8245][SQL] FormatNumber/Length Support ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7034#issuecomment-121827562 Thanks - merging this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8125] [SQL] Accelerates Parquet schema ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7396#issuecomment-121827505 [Test build #37441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37441/console) for PR 7396 at commit [`f122f10`](https://github.com/apache/spark/commit/f122f1070fb08cd737a42f683f7f8d1bb7f4a4ad). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class FakeFileStatus(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7119][SQL]Give script a default serde w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6638#issuecomment-121839582 [Test build #37462 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37462/consoleFull) for PR 6638 at commit [`4ab11b7`](https://github.com/apache/spark/commit/4ab11b7e5df106993682aef7d4bc7759827734b6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8791][SQL] Improve the InternalRow.hash...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/7189#issuecomment-121839736 Thank you all for reviewing the code for me, but I think there would be more general way to solve this, closing it for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8791][SQL] Improve the InternalRow.hash...
Github user chenghao-intel closed the pull request at: https://github.com/apache/spark/pull/7189 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/7434#discussion_r34757450 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -277,15 +276,21 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { /** * A logical plan node with no children. */ -abstract class LeafNode extends LogicalPlan with trees.LeafNode[LogicalPlan] { +abstract class LeafNode extends LogicalPlan { self: Product = + + override def children: Seq[LogicalPlan] = Nil } /** * A logical plan node with single child. */ -abstract class UnaryNode extends LogicalPlan with trees.UnaryNode[LogicalPlan] { +abstract class UnaryNode extends LogicalPlan { self: Product = --- End diff -- remove `self: Product =`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/7434#discussion_r34757444 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -277,15 +276,21 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { /** * A logical plan node with no children. */ -abstract class LeafNode extends LogicalPlan with trees.LeafNode[LogicalPlan] { +abstract class LeafNode extends LogicalPlan { self: Product = --- End diff -- remove `self: Product =`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121800272 [Test build #37439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37439/consoleFull) for PR 7434 at commit [`9c589cf`](https://github.com/apache/spark/commit/9c589cf216ff5eb46031ed332a35e6e23d91d2fe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121802635 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9085][SQL] Remove LeafNode, UnaryNode, ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7434#issuecomment-121802697 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9018][MLLIB] add stopwatches
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7415#discussion_r34752606 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/StopwatchSuite.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.util + +import org.apache.spark.SparkFunSuite +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class StopwatchSuite extends SparkFunSuite with MLlibTestSparkContext { + + private def testStopwatchOnDriver(sw: Stopwatch): Unit = { +assert(sw.name === sw) +assert(sw.elapsed() === 0L) +assert(!sw.isRunning) +intercept[AssertionError] { + sw.stop() +} +sw.start() +Thread.sleep(50) +val duration = sw.stop() +assert(duration = 50 duration 100) // using a loose upper bound +val elapsed = sw.elapsed() +assert(elapsed === duration) +sw.start() +Thread.sleep(50) +val duration2 = sw.stop() +assert(duration2 = 50 duration2 100) +val elapsed2 = sw.elapsed() +assert(elapsed2 == duration + duration2) --- End diff -- Actually @ericl pointed out `==` and `===` are equal in this case (Long). Both provide the same error message. I will update it to be consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-121815160 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-121815152 [Test build #37448 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37448/console) for PR 7379 at commit [`b405e45`](https://github.com/apache/spark/commit/b405e45d931fb04b914858e75e3fa3cb07bc0394). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class BroadcastRangeJoin(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-121820543 [Test build #37456 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37456/consoleFull) for PR 7379 at commit [`8204eae`](https://github.com/apache/spark/commit/8204eaed1b9399f17415afc6ce178c845f29746f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/7417#discussion_r34754893 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala --- @@ -34,7 +34,15 @@ case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNod val leftResults = left.execute().map(_.copy()) val rightResults = right.execute().map(_.copy()) -leftResults.cartesian(rightResults).mapPartitions { iter = +val cartesianRdd = if (leftResults.partitions.size rightResults.partitions.size) { + rightResults.cartesian(leftResults).mapPartitions { iter = +iter.map(tuple = (tuple._2, tuple._1)) + } +} else { + leftResults.cartesian(rightResults) +} + +cartesianRdd.mapPartitions { iter = val joinedRow = new JoinedRow --- End diff -- @hvanhovell Yes, use sizeInBytes is better, but also have a problem, if leftResults only have 1 record and this record size are big, and rightResults have many records and these records total size are small, then at this scenario will cause worse performance. The best way is we check the total records for the partition, but now we can not get it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8792] [ML] Add Python API for PCA trans...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7190#issuecomment-121825685 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org