[GitHub] spark issue #13651: [SPARK-15776][SQL] Divide Expression inside Aggregation ...
Github user Sephiroth-Lin commented on the issue: https://github.com/apache/spark/pull/13651 LGTM thank you --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13678: [SPARK-15824][SQL] Execute WITH .... INSERT ... statemen...
Github user Sephiroth-Lin commented on the issue: https://github.com/apache/spark/pull/13678 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13524: [SPARK-15776][SQL] Type coercion incorrect
Github user Sephiroth-Lin commented on the issue: https://github.com/apache/spark/pull/13524 @rxin Done. Pleas help review, thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13561: [SPARK-15824][SQL] Run 'with ... insert ... selec...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/13561 [SPARK-15824][SQL] Run 'with ... insert ... select' failed when use spark thriftserver ## What changes were proposed in this pull request? Dataset.collect will call withNewExecutionId and InsertIntoHadoopFsRelationCommand also will call withNewExecutionId, then for below SQL will cause IllegalArgumentException(spark.sql.execution.id is already set") ```sql create table src(k int, v int); create table src_parquet(k int, v int); with v as (select 1, 2) insert into table src_parquet from src; ``` ## How was this patch tested? Will add UT later You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-15824 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13561.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13561 commit 0f66322b748a95bfa4a122c832680c260f8da843 Author: Sephiroth-Lin Date: 2016-06-08T12:19:14Z Run 'with ... insert ... select' failed when use spark thriftserver --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13524: [SPARK-15776] Type coercion incorrect
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/13524 [SPARK-15776] Type coercion incorrect ## What changes were proposed in this pull request? Update type coercion order, details see https://issues.apache.org/jira/browse/SPARK-15776 ## How was this patch tested? Will add later (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-15776 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13524.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13524 commit 10b906a5d9d04df515675578df6d624c55b9ea41 Author: Sephiroth-Lin Date: 2016-06-06T14:42:42Z Type coercion incorrect --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7417#issuecomment-150096989 @cloud-fan OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin closed the pull request at: https://github.com/apache/spark/pull/7417 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/7417#discussion_r41956822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -274,12 +275,30 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { } object CartesianProduct extends Strategy { +def getSmallSide(left: LogicalPlan, right: LogicalPlan): BuildSide = { + if (right.statistics.sizeInBytes < left.statistics.sizeInBytes) { +joins.BuildRight + } else { +joins.BuildLeft + } +} --- End diff -- OK, no problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/7417#discussion_r41855351 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala --- @@ -28,9 +28,17 @@ import org.apache.spark.sql.execution.metric.SQLMetrics * :: DeveloperApi :: */ @DeveloperApi -case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNode { +case class CartesianProduct( +left: SparkPlan, +right: SparkPlan, +buildSide: BuildSide) extends BinaryNode { --- End diff -- @yhuai use buildSide just want to know which side is small, and use this to decide whether we need to change the order. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9596][SQL]treat hadoop classes as share...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/7931#discussion_r41853694 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -124,6 +124,7 @@ private[hive] class IsolatedClientLoader( name.contains("slf4j") || name.contains("log4j") || name.startsWith("org.apache.spark.") || +(name.startsWith("org.apache.hadoop.") && !name.startsWith("org.apache.hadoop.hive.")) || --- End diff -- scope is too huge, and without reload hadoop, will only have one FileSystem.Cache which will cause [SPARK-11083](https://issues.apache.org/jira/browse/SPARK-11083) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9522][SQL] SparkSubmit process can not ...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7853#issuecomment-141082236 @andrewor14 I have set stopped to private[spark], @liancheng @yhuai any thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7417#issuecomment-138554397 @scwf done. @zsxwing updated code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/7417#discussion_r38504238 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala --- @@ -27,16 +27,27 @@ import org.apache.spark.sql.execution.{BinaryNode, SparkPlan} * :: DeveloperApi :: */ @DeveloperApi -case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNode { +case class CartesianProduct( +left: SparkPlan, +right: SparkPlan, +buildSide: BuildSide) extends BinaryNode { override def output: Seq[Attribute] = left.output ++ right.output + private val (small, big) = buildSide match { +case BuildRight => (left, right) +case BuildLeft => (right, left) + } + protected override def doExecute(): RDD[InternalRow] = { -val leftResults = left.execute().map(_.copy()) -val rightResults = right.execute().map(_.copy()) +val leftResults = small.execute().map(_.copy()) +val rightResults = big.execute().map(_.copy()) --- End diff -- @davies Sorry, I don't very clear. Use zipPartition() can get two iters, then we use these 2 iters do cartesian by ourselfe, don't call cartesian()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9519][Yarn] Confirm stop sc successfull...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7846#issuecomment-127815406 @vanzin @srowen Updated, thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9519][Yarn] Confirm stop sc successfull...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7846#issuecomment-12629 Yes, this change doesn't stop this sequence from happening. As monitor thread is daemon thread, we don't need call interrupt as sc.stop(). Below I am not very clear: 1. there's still a race condition 2. The thread can have a "stop" method that interrupts it only if it's blocked in monitorApplication Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9522][SQL] SparkSubmit process can not ...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/7853 [SPARK-9522][SQL] SparkSubmit process can not exit if kill application when HiveThriftServer was starting When we start HiveThriftServer, we will start SparkContext first, then start HiveServer2, if we kill application while HiveServer2 is starting then SparkContext will stop successfully, but SparkSubmit process can not exit. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-9522 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7853.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7853 commit a48482c803a74b2e51a0257b8f2185ff9136559c Author: linweizhong Date: 2015-08-01T08:26:12Z SparkSubmit process can not exit if kill application when HiveThriftServer was starting --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7417#issuecomment-126880682 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9519][Yarn] Confirm stop sc successfull...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7846#issuecomment-126879883 @srowen We need call interrupt in YarnClientSchedulerBackend.stop(), details see PR #5305 and PR #3143, so even if we call sc.stop() in the finally block of the monitor thread it also can not stop successfully. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7417#issuecomment-126856094 @hvanhovell Good suggestion, thank you, updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9519][Yarn] Confirm stop sc successfull...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/7846 [SPARK-9519][Yarn] Confirm stop sc successfully when application was killed Currently, when we kill application on Yarn, then will call sc.stop() at Yarn application state monitor thread, then in YarnClientSchedulerBackend.stop() will call interrupt this will cause SparkContext not stop fully as we will wait executor to exit. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-9519 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7846.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7846 commit 243d2c79b7587e33bf32d8df3c5adcbe6fa9b251 Author: linweizhong Date: 2015-08-01T03:05:21Z Confirm stop sc successfully when application was killed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7417#issuecomment-123925858 @hvanhovell I use tpc-ds to test, for below SQL clause: ``` with single_value as ( select 1 tpcds_val from date_dim ) select sum(ss_quantity * ss_sales_price) ssales, tpcds_val from store_sales, single_value group by tpcds_val ``` use this patch run1h55min, without this patch run half tasks use 16.7h --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/7417#discussion_r35180395 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastCartesianProduct.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.joins + +import scala.concurrent._ +import scala.concurrent.duration._ + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Attribute, JoinedRow} +import org.apache.spark.sql.execution.{BinaryNode, SparkPlan} +import org.apache.spark.util.ThreadUtils + +/** + * :: DeveloperApi :: + */ +@DeveloperApi +case class BroadcastCartesianProduct( --- End diff -- BroadcastNestedLoopJoin just used for out join right? But this is used for cartesian. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/7417#discussion_r34754893 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala --- @@ -34,7 +34,15 @@ case class CartesianProduct(left: SparkPlan, right: SparkPlan) extends BinaryNod val leftResults = left.execute().map(_.copy()) val rightResults = right.execute().map(_.copy()) -leftResults.cartesian(rightResults).mapPartitions { iter => +val cartesianRdd = if (leftResults.partitions.size > rightResults.partitions.size) { + rightResults.cartesian(leftResults).mapPartitions { iter => +iter.map(tuple => (tuple._2, tuple._1)) + } +} else { + leftResults.cartesian(rightResults) +} + +cartesianRdd.mapPartitions { iter => val joinedRow = new JoinedRow --- End diff -- @hvanhovell Yes, use sizeInBytes is better, but also have a problem, if leftResults only have 1 record and this record size are big, and rightResults have many records and these records total size are small, then at this scenario will cause worse performance. The best way is we check the total records for the partition, but now we can not get it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7417#issuecomment-121588200 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9066][SQL] Improve cartesian performanc...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/7417 [SPARK-9066][SQL] Improve cartesian performance see jira https://issues.apache.org/jira/browse/SPARK-9066 You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-9066 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7417.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7417 commit 0a620989e1e857ba9c84389493dc5f45a29450f6 Author: linweizhong Date: 2015-07-15T09:17:01Z Optimize cartesian order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8811][SQL] Read array struct data from ...
Github user Sephiroth-Lin closed the pull request at: https://github.com/apache/spark/pull/7209 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8811][SQL] Read array struct data from ...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7209#issuecomment-119504817 @liancheng OK, no problem. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8811][SQL] Read array struct data from ...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7209#issuecomment-119064179 @liancheng I have updated, please help to review, thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8811][SQL] Read array struct data from ...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/7209#issuecomment-118699916 @liancheng OK, good, thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8811][SQL] Read array struct data from ...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/7209 [SPARK-8811][SQL] Read array struct data from parquet error JIRA:https://issues.apache.org/jira/browse/SPARK-8811 For example: we have a table: ``` t1(c1 string, c2 string, arr_c1 array>, arr_c2 array>) we save data in parquet. for select * from t1, we know in parquet the fileSchema may be: message hive_schema { optional binary c1; optional binary c2; optional group arr_c1 (LIST) { repeated group bag { optional group array_element { optional binary IN_C1; optional binary IN_C2; } } } optional group arr_c2 (LIST) { repeated group bag { optional group array_element { optional binary IN_C1; optional binary IN_C2; } } } } but the requestSchema is: message root { optional binary c1; optional binary c2; optional group arr_c1 (LIST) { repeated group bag { optional group element { optional binary IN_C1; optional binary IN_C2; } } } optional group arr_c2 (LIST) { repeated group bag { optional group element { optional binary IN_C1; optional binary IN_C2; } } } } ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-8811 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7209.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7209 commit ecd25477abd6735514ab48549a4a937bf6d00f42 Author: linweizhong Date: 2015-07-03T07:55:00Z Change schema for array type from element to array_element --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8162][BUILD] Run spark-shell cause Null...
Github user Sephiroth-Lin closed the pull request at: https://github.com/apache/spark/pull/6704 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8162][BUILD] Run spark-shell cause Null...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6704#issuecomment-110231290 Close it first as PR #6711 can fix NPE, if we find the root cause of why the `@VisibleForTesting` annotation causes a NPE in the shell then reopen it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8162][BUILD] Run spark-shell cause Null...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6704#issuecomment-109965178 @srowen I build the Spark with comman **`mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Phive -Phive-thriftserver -Psparkr -DskipTests package`** and run spark-shell with comman **`./bin/spark-shell --master yarn-client`** * Maven: 3.x * JDK: 1.8.0_40 * OS: SUSE 11 SP3 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Run spark-shell cause NullPointerException
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/6704 Run spark-shell cause NullPointerException see jira https://issues.apache.org/jira/browse/SPARK-8162 JDK: 1.8.0_40 Hadoop: 2.7.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-8162 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6704.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6704 commit 9e8a3fec918d6d11dada6f1bf2db94df8b668537 Author: linweizhong Date: 2015-06-08T11:07:17Z Add com.google.common.annotations.VisibleForTesting to assembly jar as we need it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6409#issuecomment-109820635 @srowen @vanzin This PR can cleanup correctly. I just mean without this PR even if we add KILLED status on ApplicationMaster to check, then it can not cleanup when then application is killed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6409#issuecomment-108165847 @vanzin I have tested again, and below is the result of final status when we use yarn to kill the application: \ | YARN UI | Driver Log | AppMaster Log --- |--||-- yarn-client | KILLED | KILLED |FAILED --- |--||-- yarn-cluster| KILLED || UNDEFINED --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/6409#discussion_r31500453 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -91,51 +91,54 @@ private[spark] class Client( * available in the alpha API. */ def submitApplication(): ApplicationId = { -var appId: ApplicationId = null +// Before we submit current application, we cleanup staging director as some old appStagingDir +// can not be deleted when those old jobs are failed or killed and so on, please see SPARK-7705 +// and SPARK-7503 for details. +cleanupStagingDir() + +// Setup the credentials before doing anything else, so we have don't have issues at any point. +setupCredentials() +yarnClient.init(yarnConf) +yarnClient.start() + +logInfo("Requesting a new application from cluster with %d NodeManagers" + .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers)) + +// Get a new application from our RM +val newApp = yarnClient.createApplication() +val newAppResponse = newApp.getNewApplicationResponse() +val appId = newAppResponse.getApplicationId() + +// Verify whether the cluster has enough resources for our AM +verifyClusterResources(newAppResponse) + +// Set up the appropriate contexts to launch our AM +val containerContext = createContainerLaunchContext(newAppResponse) +val appContext = createApplicationSubmissionContext(newApp, containerContext) + +// Finally, submit and monitor the application +logInfo(s"Submitting application ${appId.getId} to ResourceManager") +yarnClient.submitApplication(appContext) +appId + } + + /** + * Cleanup all subdirectory of SPARK_STAGING directory. + */ + private def cleanupStagingDir(): Unit = { +val stagingDirPath = new Path(SPARK_STAGING) --- End diff -- I'm so sorry, thank you for point out my mistake. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/6409#discussion_r31490416 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -825,6 +813,9 @@ private[spark] class Client( * throw an appropriate SparkException. */ def run(): Unit = { +// Cleanup staging director as some appStagingDir can not be deleted when job is failed or +// killed, please see SPARK-7705 for details. +cleanupStagingDir() --- End diff -- 1. Cleanup old application staging directory before submit current application. 2. Yes, if called on run on yarn-client will not work, yarn-cluster is ok, so called on submitApplication(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6409#issuecomment-107399469 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/6409#discussion_r31397611 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -849,6 +852,27 @@ private[spark] class Client( } } } + + private def cleanupStagingDir(): Unit = { --- End diff -- Yes, we need to refactor, thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7026] [SQL] fix left semi join with equ...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5643#discussion_r31304130 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala --- @@ -32,36 +32,59 @@ case class BroadcastLeftSemiJoinHash( leftKeys: Seq[Expression], rightKeys: Seq[Expression], left: SparkPlan, -right: SparkPlan) extends BinaryNode with HashJoin { +right: SparkPlan, +condition: Option[Expression]) extends BinaryNode with HashJoin { override val buildSide: BuildSide = BuildRight override def output: Seq[Attribute] = left.output + @transient private lazy val boundCondition = +newPredicate(condition.getOrElse(Literal(true)), left.output ++ right.output) + protected override def doExecute(): RDD[Row] = { val buildIter= buildPlan.execute().map(_.copy()).collect().toIterator -val hashSet = new java.util.HashSet[Row]() -var currentRow: Row = null -// Create a Hash set of buildKeys -while (buildIter.hasNext) { - currentRow = buildIter.next() - val rowKey = buildSideKeyGenerator(currentRow) - if (!rowKey.anyNull) { -val keyExists = hashSet.contains(rowKey) -if (!keyExists) { - hashSet.add(rowKey) +condition match { + case None => +val hashSet = new java.util.HashSet[Row]() +var currentRow: Row = null + +// Create a Hash set of buildKeys +while (buildIter.hasNext) { + currentRow = buildIter.next() + val rowKey = buildSideKeyGenerator(currentRow) + if (!rowKey.anyNull) { +val keyExists = hashSet.contains(rowKey) +if (!keyExists) { + hashSet.add(rowKey) +} + } } - } -} -val broadcastedRelation = sparkContext.broadcast(hashSet) +val broadcastedRelation = sparkContext.broadcast(hashSet) -streamedPlan.execute().mapPartitions { streamIter => - val joinKeys = streamSideKeyGenerator() - streamIter.filter(current => { -!joinKeys(current).anyNull && broadcastedRelation.value.contains(joinKeys.currentValue) - }) +streamedPlan.execute().mapPartitions { streamIter => + val joinKeys = streamSideKeyGenerator() + streamIter.filter(current => { +!joinKeys(current).anyNull && broadcastedRelation.value.contains(joinKeys.currentValue) + }) +} + case _ => +val hashRelation = HashedRelation(buildIter, buildSideKeyGenerator) +val broadcastedRelation = sparkContext.broadcast(hashRelation) + +streamedPlan.execute().mapPartitions { streamIter => + val joinKeys = streamSideKeyGenerator() + val joinedRow = new JoinedRow + + streamIter.filter(current => { +val rowBuffer = broadcastedRelation.value.get(joinKeys.currentValue) --- End diff -- we need to apply first before we get currentValue, or will get null for the first row. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6409#issuecomment-106286738 @tgravescs yes, if yarn do it is better, but now it didn't, so as @vanzin said may be we can do it when launcher, thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6409#issuecomment-105715873 @tgravescs I have tested below: max retried is defaule, use yarn -kill to kill application when application start running, run SparkPi with parameter 2. --- yarn-cluster: YARN UI AppMaster Log(add code to print the final status on ApplicationMaster.scala line 127) KILLEDFAILED --- yarn-client: YARN UI AppMaster Log KILLEDUNDEFINED --- @vanzin yes, this may break application retries, we need to consider more, and I will try. @srowen @tgravescs @vanzin thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7339][PySpark] PySpark shuffle spill me...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5887#issuecomment-105516789 ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7705][Yarn] Cleanup of .sparkStaging di...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/6409 [SPARK-7705][Yarn] Cleanup of .sparkStaging directory fails if application is killed As I have tested, if we cancel or kill the app then the final status may be undefined, killed or succeeded, so clean up staging directory when appMaster exit at any final application status. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-7705 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6409.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6409 commit 95595c30cce6708bd0470f66b79b3ed9d66a5d03 Author: linweizhong Date: 2015-05-26T12:27:43Z Cleanup of .sparkStaging directory when AppMaster exit at any final application status --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7339][PySpark] PySpark shuffle spill me...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5887#issuecomment-102756602 @andrewor14 what's your opinion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7339][PySpark] PySpark shuffle spill me...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5887#issuecomment-101951195 @davies what's your opinion now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7595][SQL] Window will cause resolve fa...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6114#issuecomment-10184 @scwf @yhuai Done, thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7595][SQL] Window will cause resolve fa...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/6114 [SPARK-7595][SQL] Window will cause resolve failed with self join for example: table: src(key string, value string) sql: with v1 as(select key, count(value) over (partition by key) cnt_val from src), v2 as(select v1.key, v1_lag.cnt_val from v1, v1 v1_lag where v1.key = v1_lag.key) select * from v2 limit 5; then will analyze fail when resolving conflicting references in Join: 'Limit 5 'Project [*] 'Subquery v2 'Project ['v1.key,'v1_lag.cnt_val] 'Filter ('v1.key = 'v1_lag.key) 'Join Inner, None Subquery v1 Project [key#95,cnt_val#94L] Window [key#95,value#96], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#96) WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Project [key#95,value#96] MetastoreRelation default, src, None Subquery v1_lag Subquery v1 Project [key#97,cnt_val#94L] Window [key#97,value#98], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#98) WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Project [key#97,value#98] MetastoreRelation default, src, None Conflicting attributes: cnt_val#94L You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark spark-7595 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6114.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6114 commit dfe9169c10360417e705516f87bfed29d7eef01d Author: linweizhong Date: 2015-05-13T06:56:16Z Handle windowExpression with self join --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7526][SparkR] Specify ip of RBackend, M...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/6053#issuecomment-101105615 @shivaram Yes, I also think there should be no problems, as it is not system dependent. I will test this on Windows, thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7526][SparkR] Specify ip of RBackend, M...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/6053 [SPARK-7526][SparkR] Specify ip of RBackend, MonitorServer and RRDD Socket server These R process only used to communicate with JVM process on local, so binding to localhost is more reasonable then wildcard ip. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark spark-7526 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6053.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6053 commit 5303af767b21ddeb4e57faeb5774f3ebc498733c Author: linweizhong Date: 2015-05-11T12:54:51Z bind to localhost rather than wildcard ip --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][PySpark] Set PYTHONPATH to python/lib/...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/6047 [Minor][PySpark] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark As PR#5580 we have create pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark pyspark_pythonpath Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6047.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6047 commit 8cc3d96da953292ae9a34917008ff0536cfb4381 Author: linweizhong Date: 2015-05-11T02:35:34Z Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7339][PySpark] PySpark shuffle spill me...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5887#discussion_r29729314 --- Diff: python/pyspark/shuffle.py --- @@ -362,7 +362,9 @@ def _spill(self): self.spills += 1 gc.collect() # release the memory as much as possible -MemoryBytesSpilled += (used_memory - get_used_memory()) << 20 +memorySpilled = used_memory - get_used_memory() --- End diff -- Updated, thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7339][PySpark] PySpark shuffle spill me...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5887#issuecomment-98901992 Jenkins retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7339][PySpark] PySpark shuffle spill me...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5887 [SPARK-7339][PySpark] PySpark shuffle spill memory sometimes are not correct You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark spark-7339 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5887.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5887 commit d41672b70c44003ff8c1ad8f3703f6da52c824a4 Author: linweizhong Date: 2015-05-04T12:28:28Z Update MemoryBytesSpilled when memorySpilled > 0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Add pyspark archives pat...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5580#issuecomment-97346388 If user don't use make-distribution.sh and just compile Spark use maven or sbt, then don't have pyspark.zip. So we really don't need to do the zip in the code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Add pyspark archives pat...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5478#issuecomment-96867560 @tgravescs yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Add pyspark archives pat...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5478#issuecomment-96145643 @andrewor14 @sryza how about your opinions? thanks. @lianhuiwang please help me review this, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [PySpark][Minor] Update sql example, so that c...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5684 [PySpark][Minor] Update sql example, so that can read file correctly To run Spark, default will read file from HDFS if we don't set the schema. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark pyspark_example_minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5684.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5684 commit 19fe145e7a00574080b91d311376b6d2cdb4254e Author: linweizhong Date: 2015-04-24T09:16:23Z Update example sql.py, so that can read file correctly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5689][Doc] Document what can be run in ...
Github user Sephiroth-Lin closed the pull request at: https://github.com/apache/spark/pull/5490 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Add pyspark archives pat...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5478#issuecomment-95102969 @andrewor14 Sorry, these days I am busy, now I have update the code. ^-^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Add pyspark archives pat...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5478#issuecomment-94331295 @lianhuiwang OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6604][PySpark]Specify ip of python serv...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5256#issuecomment-93915251 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Add pyspark archives pat...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5478#issuecomment-93724717 @andrewor14 @sryza @WangTaoTheTonic As I have test again, if we install Spark on each node, then we can set spark.executorEnv.PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip to pass PYTHONPATH to executor. So this PR is another solution to run PySpark on yan if we don't install Spark on each node. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Add pyspark archives pat...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5478#issuecomment-93705830 @andrewor14 @sryza Done, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6604][PySpark]Specify ip of python serv...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5256#issuecomment-93650104 @srowen OK, thanks. Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Pass PYTHONPATH to execu...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5478#issuecomment-93270239 @andrewor14 @sryza Yes, to assume that the python files will already be present on the slave machines is not very reasonable. But if user want to use PySpark, then they must compile the Spark in JDK1.6, but I think now most user are use JDK1.7+. Maybe a good solution is package the PySpark in another jar and automatically shipped by YARN to all containers. And add this jar to PYTHONPATH with asseambly jar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5689][Doc] Document what can be run in ...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5490 [SPARK-5689][Doc] Document what can be run in different YARN modes You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-5689 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5490.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5490 commit 97ba6a8f7a91433f74fe81e1107d649203621192 Author: linweizhong Date: 2015-04-13T12:37:07Z Document what can be run in different YARN modes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6870][Yarn] Catch InterruptedException ...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5479#discussion_r28231958 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -128,10 +128,14 @@ private[spark] class YarnClientSchedulerBackend( assert(client != null && appId != null, "Application has not been submitted yet!") val t = new Thread { override def run() { -val (state, _) = client.monitorApplication(appId, logApplicationReport = false) -logError(s"Yarn application has already exited with state $state!") -sc.stop() -Thread.currentThread().interrupt() +try { + val (state, _) = client.monitorApplication(appId, logApplicationReport = false) --- End diff -- We interrupt the monitor thread when we call stop(), so don't need to call sc.stop() again. We add sc.stop() after client.monitorApplication return just to confirm we can stop SparkContext when app has finished/failed/killed before we stop SparkContext. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6870][Yarn] Catch InterruptedException ...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5479#discussion_r28210698 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -128,10 +128,14 @@ private[spark] class YarnClientSchedulerBackend( assert(client != null && appId != null, "Application has not been submitted yet!") val t = new Thread { override def run() { -val (state, _) = client.monitorApplication(appId, logApplicationReport = false) -logError(s"Yarn application has already exited with state $state!") -sc.stop() -Thread.currentThread().interrupt() +try { + val (state, _) = client.monitorApplication(appId, logApplicationReport = false) --- End diff -- Yes, we don't need to call Thread.currentThread().interrupt() here, but I think we need to stop the SparkContext. If user kill the app on Yarn, then we need to stop the SparkContext right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6870][Yarn] Catch InterruptedException ...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5479 [SPARK-6870][Yarn] Catch InterruptedException when yarn application state monitor thread been interrupted On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-6870 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5479.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5479 commit 3513fdb943c41e5242ad187dccddad65a9870288 Author: linweizhong Date: 2015-04-12T08:13:20Z Catch InterruptedException commit 0d8958a28addb68c9263679e898c286cbfdc9eff Author: linweizhong Date: 2015-04-12T08:16:16Z Update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6869][PySpark] Pass PYTHONPATH to execu...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5478 [SPARK-6869][PySpark] Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in spark-env.sh) to executor so that executor python process can read pyspark file from local file system rather than from assembly jar. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-6869 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5478.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5478 commit 413fa25dde845146153a58793ca6b3ec3a820ea8 Author: linweizhong Date: 2015-04-12T08:02:43Z Pass PYTHONPATH to executor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4346][SPARK-3596][YARN] Commonize the m...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5305#discussion_r27939765 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -127,23 +127,11 @@ private[spark] class YarnClientSchedulerBackend( assert(client != null && appId != null, "Application has not been submitted yet!") val t = new Thread { override def run() { -while (!stopping) { - var state: YarnApplicationState = null - try { -val report = client.getApplicationReport(appId) -state = report.getYarnApplicationState() - } catch { -case e: ApplicationNotFoundException => - state = YarnApplicationState.KILLED - } - if (state == YarnApplicationState.FINISHED || -state == YarnApplicationState.KILLED || -state == YarnApplicationState.FAILED) { -logError(s"Yarn application has already exited with state $state!") -sc.stop() -stopping = true - } - Thread.sleep(1000L) +val (state, _) = client.monitorApplication(appId, logApplicationReport = false) +if (!stopping) { --- End diff -- Right. We need to interrupt the thread in stop(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4346][SPARK-3596][YARN] Commonize the m...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5305#discussion_r2372 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -559,50 +560,56 @@ private[spark] class Client( var lastState: YarnApplicationState = null while (true) { Thread.sleep(interval) - val report = getApplicationReport(appId) - val state = report.getYarnApplicationState - - if (logApplicationReport) { -logInfo(s"Application report for $appId (state: $state)") -val details = Seq[(String, String)]( - ("client token", getClientToken(report)), - ("diagnostics", report.getDiagnostics), - ("ApplicationMaster host", report.getHost), - ("ApplicationMaster RPC port", report.getRpcPort.toString), - ("queue", report.getQueue), - ("start time", report.getStartTime.toString), - ("final status", report.getFinalApplicationStatus.toString), - ("tracking URL", report.getTrackingUrl), - ("user", report.getUser) -) - -// Use more loggable format if value is null or empty -val formattedDetails = details - .map { case (k, v) => - val newValue = Option(v).filter(_.nonEmpty).getOrElse("N/A") - s"\n\t $k: $newValue" } - .mkString("") - -// If DEBUG is enabled, log report details every iteration -// Otherwise, log them every time the application changes state -if (log.isDebugEnabled) { - logDebug(formattedDetails) -} else if (lastState != state) { - logInfo(formattedDetails) + try { +val report = getApplicationReport(appId) --- End diff -- Done. Thank you!!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3596][YARN]Support changing the yarn cl...
Github user Sephiroth-Lin closed the pull request at: https://github.com/apache/spark/pull/5292 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3596][YARN]Support changing the yarn cl...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5292#discussion_r27647902 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -125,6 +125,7 @@ private[spark] class YarnClientSchedulerBackend( */ private def asyncMonitorApplication(): Unit = { assert(client != null && appId != null, "Application has not been submitted yet!") +val interval = conf.getLong("spark.yarn.client.progress.pollinterval", 1000) --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4346][SPARK-3596][YARN] Commonize the m...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5305#issuecomment-88838919 @srowen unit tests failed at run Python app on yarn-cluster mode, I think this didn't cause by this PR, please ask jenkins to retest, thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3596][YARN]Support changing the yarn cl...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5292#discussion_r27642711 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -125,6 +125,7 @@ private[spark] class YarnClientSchedulerBackend( */ private def asyncMonitorApplication(): Unit = { assert(client != null && appId != null, "Application has not been submitted yet!") +val interval = conf.getLong("spark.yarn.client.progress.pollinterval", 1000) --- End diff -- @srowen Yes, #5305 can solve this issue, may be we can close this PR first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3596][YARN]Support changing the yarn cl...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5292#discussion_r27636093 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -125,6 +125,7 @@ private[spark] class YarnClientSchedulerBackend( */ private def asyncMonitorApplication(): Unit = { assert(client != null && appId != null, "Application has not been submitted yet!") +val interval = conf.getLong("spark.yarn.client.progress.pollinterval", 1000) --- End diff -- Yeah, you are right. In PR #5305 I use the client.monitorApplication, then we can use "spark.yarn.report.interval" to changing the yarn client monitor interval. Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4346][SPARK-3596][YARN] Commonize the m...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5305#issuecomment-88752700 Jenkins, retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/5294#issuecomment-88708773 @tgravescs @srowen @sryza As i have retest again, if we don't populate hadoop classpath, then in all case it dosen't work. This PR cann't solve this issue, i will close it later, thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
Github user Sephiroth-Lin closed the pull request at: https://github.com/apache/spark/pull/5294 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4346][SPARK-3596][YARN] Commonize the m...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5305 [SPARK-4346][SPARK-3596][YARN] Commonize the monitor logic 1. YarnClientSchedulerBack.asyncMonitorApplication use Client.monitorApplication so that commonize the monitor logic 2. Support changing the yarn client monitor interval, see #5292 You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-4346_3596 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5305.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5305 commit 568f46f6cd4ed38ddf8a018d8d532f9be2228045 Author: unknown Date: 2015-04-01T05:50:25Z YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication commit 6b47ff7c21daf0db42e9a7f3233daf90bb70ee63 Author: unknown Date: 2015-04-01T06:17:14Z Update code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3596][YARN]Support changing the yarn cl...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/5292#discussion_r27540657 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala --- @@ -125,6 +125,7 @@ private[spark] class YarnClientSchedulerBackend( */ private def asyncMonitorApplication(): Unit = { assert(client != null && appId != null, "Application has not been submitted yet!") +val interval = conf.getLong("spark.yarn.client.progress.pollinterval", 1000) --- End diff -- Thank you, but as it is the client to get the application report from the RM, so maybe "spark.yarn.client.progress.pollinterval" is better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1502][YARN]Add config option to not inc...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5294 [SPARK-1502][YARN]Add config option to not include yarn/mapred cluster classpath You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-1502 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5294.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5294 commit 96aa689b8b65ce73e13e4f48a49b85a5f8ed751a Author: unknown Date: 2015-03-31T11:31:13Z Add config option to not include yarn/mapred cluster classpath --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3596][YARN]Support changing the yarn cl...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5292 [SPARK-3596][YARN]Support changing the yarn client monitor interval You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-3596 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5292.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5292 commit 7d6c4746986f78a37f31a12b92e0cf14332a01a4 Author: unknown Date: 2015-03-31T11:08:26Z Support changing the yarn client monitor interval --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Specify ip of python server scoket
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/5256 Specify ip of python server scoket In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process. /cc @davies You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-6604 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5256.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5256 commit c88bee9819eef5a8091357d6a239e9ab61da0050 Author: unknown Date: 2015-03-30T06:21:07Z Specify ip of python server scoket --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5830][Core]Don't create unnecessary dir...
Github user Sephiroth-Lin closed the pull request at: https://github.com/apache/spark/pull/4620 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5830][Core]Don't create unnecessary dir...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/4620#issuecomment-76895189 @srowen ok, pls help to close this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5830][Core]Don't create unnecessary dir...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/4620#issuecomment-75919413 @srowen as PR #4747 will cache the local root directories, then we can close this PR first. For PR #4747 I think we also need to remove the local root directories after application is exited or SparkContext is stoped, or else also will create too many empty directories. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5801] [core] Avoid creating nested dire...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/4747#discussion_r25322767 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -728,6 +746,11 @@ private[spark] object Utils extends Logging { localDirs } + /** Used by unit tests. Do not call from other places. */ + private[spark] def clearLocalRootDirs(): Unit = { --- End diff -- May be we can call this function to delete the local root directory in non-yarn mode when application is exited or SparkContext is stoped. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5830][Core]Don't create unnecessary dir...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/4620#issuecomment-74864163 @srowen ok, thank you. If this subdirectory is really needed, may be we can add code to delete this subdirectory after jvm exit or sc.stop(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5830][Core]Don't create unnecessary dir...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/4620#issuecomment-74860104 @srowen as in function "getOrCreateLocalRootDirs" will create a subdirectory for root local dir, then if we call "getLocalDir" will create a subdirectory for root local dir. who call getOrCreateLocalRootDirs directly. In current master branch, when we create tmp dir will call getLocaalDir first, so it will create nested directories. And in standalone mode, will create tmp dir first when lunch executor, so total it will create 4 levels directories, in other mode it will create 2 levels directories for all tmp dir. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5830][Core]Don't create unnecessary dir...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/4620#issuecomment-74604408 @srowen yes, this is same as SPARK-5801. In standalone, worker will create temp directories for executor, so if we create an unnecessary directory for local root directory, then when we create temp directory will create too many nested directories. @srowen @andrewor14 from the CI report, test failed is not caused by this PR, can retest it again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5830][Core]Don't create unnecessary dir...
GitHub user Sephiroth-Lin opened a pull request: https://github.com/apache/spark/pull/4620 [SPARK-5830][Core]Don't create unnecessary directory for local root dir Now will create an unnecessary directory for local root directory, and this directory will not be deleted after application exit. For example: before will create tmp dir like "/tmp/spark-UUID" now will create tmp dir like "/tmp/spark-UUID/spark-UUID" so the dir "/tmp/spark-UUID" will not be deleted as a local root directory. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sephiroth-Lin/spark SPARK-5830 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4620.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4620 commit 916fa04408e4d14b10734537402603b92763ca6d Author: Sephiroth-Lin Date: 2015-02-16T06:36:53Z Don't create unnecessary directory for local root dir commit 26670d83fee7c3bc0681ca775ab9f0dbc3da9d2d Author: Sephiroth-Lin Date: 2015-02-16T06:37:37Z Don't create unnecessary directory for local root dir --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5644] [Core]Delete tmp dir when sc is s...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/4412#issuecomment-73682144 @srowen thank you, please help to check again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5644] [Core]Delete tmp dir when sc is s...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/4412#discussion_r24404265 --- Diff: core/src/main/scala/org/apache/spark/HttpFileServer.scala --- @@ -50,6 +50,15 @@ private[spark] class HttpFileServer( def stop() { httpServer.stop() + +// If we only stop sc, but the driver process still run as a services then we need to delete +// the tmp dir, if not, it will create too many tmp dirs +try { + Utils.deleteRecursively(baseDir) +} catch { + case e: Exception => +logWarning("Exception while deleting Spark temp dir: " + baseDir.getAbsolutePath, e) --- End diff -- OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5644] [Core]Delete tmp dir when sc is s...
Github user Sephiroth-Lin commented on the pull request: https://github.com/apache/spark/pull/4412#issuecomment-73655978 @srowen thank you, now I add a member to store the reference of the tmp dir if it was created, please help to check again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5644] [Core]Delete tmp dir when sc is s...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/4412#discussion_r24306573 --- Diff: core/src/main/scala/org/apache/spark/SparkEnv.scala --- @@ -93,6 +93,19 @@ class SparkEnv ( // actorSystem.awaitTermination() // Note that blockTransferService is stopped by BlockManager since it is started by it. + +// If we only stop sc, but the driver process still run as a services then we need to delete +// the tmp dir, if not, it will create too many tmp dirs. +// We only need to delete the tmp dir create by driver, because sparkFilesDir is point to the +// current working dir in executor which we do not need to delete. +if (SparkContext.DRIVER_IDENTIFIER == executorId) { --- End diff -- @srowen Than you. If we want to make this much more intimately bound, may be we can check the sparkFilesDir directly, or else, we need to add a parameter to SparkEnv class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5644] [Core]Delete tmp dir when sc is s...
Github user Sephiroth-Lin commented on a diff in the pull request: https://github.com/apache/spark/pull/4412#discussion_r24305832 --- Diff: core/src/main/scala/org/apache/spark/SparkEnv.scala --- @@ -93,6 +93,19 @@ class SparkEnv ( // actorSystem.awaitTermination() // Note that blockTransferService is stopped by BlockManager since it is started by it. + +// If we only stop sc, but the driver process still run as a services then we need to delete +// the tmp dir, if not, it will create too many tmp dirs. +// We only need to delete the tmp dir create by driver, because sparkFilesDir is point to the +// current working dir in executor which we do not need to delete. +if (SparkContext.DRIVER_IDENTIFIER == executorId) { --- End diff -- @srowen sorry, I don't very clear. You mean we can not use the executorId to distinguish the driver and executor? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org