[GitHub] spark pull request #21526: [SPARK-24515][CORE] No need to warning when outpu...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/21526 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21790: [SPARK-24544][SQL] Print actual failure cause when look ...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21790 I will fix the test and add unit test as soon as possible @maropu I am too busy last month,sorry for the late reply. Thanks for your comments and your precious time again! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21790: [SPARK-24544][SQL] Print actual failure cause when look ...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21790 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21790: [SPARK-24544][SQL] Print actual failure cause when look ...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21790 @maropu Could you help review this?Thanks since https://github.com/apache/spark/pull/21552 i used `git merge` so reopen this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21790: [SPARK-24544][SQL] Print actual failure cause whe...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/21790 [SPARK-24544][SQL] Print actual failure cause when look up function f⦠## What changes were proposed in this pull request? When we operate as below: ` 0: jdbc:hive2://xxx/> create function funnel_analysis as 'com.xxx.hive.extend.udf.UapFunnelAnalysis'; ` ` 0: jdbc:hive2://xxx/> select funnel_analysis(1,",",1,''); Error: org.apache.spark.sql.AnalysisException: Undefined function: 'funnel_analysis'. This function is neither a registered temporary function nor a permanent function registered in the database 'xxx'.; line 1 pos 7 (state=,code=0) ` ` 0: jdbc:hive2://xxx/> describe function funnel_analysis; +---+--+ | function_desc | +---+--+ | Function: xxx.funnel_analysis| | Class: com.xxx.hive.extend.udf.UapFunnelAnalysis | | Usage: N/A. | +---+--+ ` We can see describe funtion will get right information,but when we actually use this funtion,we will get an undefined exception. Which is really misleading,the real cause is below: ` No handler for Hive UDF 'com.xxx.xxx.hive.extend.udf.UapFunnelAnalysis': java.lang.IllegalStateException: Should not be called directly; at org.apache.hadoop.hive.ql.udf.generic.GenericUDTF.initialize(GenericUDTF.java:72) at org.apache.spark.sql.hive.HiveGenericUDTF.outputInspector$lzycompute(hiveUDFs.scala:204) at org.apache.spark.sql.hive.HiveGenericUDTF.outputInspector(hiveUDFs.scala:204) at org.apache.spark.sql.hive.HiveGenericUDTF.elementSchema$lzycompute(hiveUDFs.scala:212) at org.apache.spark.sql.hive.HiveGenericUDTF.elementSchema(hiveUDFs.scala:212) ` This patch print the actual failure for quick debugging. ## How was this patch tested? UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/print-warning1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21790.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21790 commit 690035a877d21de75310011eeecc80f2ff87b4bf Author: zhoukang Date: 2018-07-17T10:07:21Z [SPARK-24544][SQL] Print actual failure cause when look up function failed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21552: [SPARK-24544][SQL] Print actual failure cause whe...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/21552 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21664: [SPARK-24687][CORE] NoClassDefFoundError will not be cat...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21664 I think this patch will fix 'job hanging' problem @jerryshao @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21263: [SPARK-24084][ThriftServer] Add job group id for ...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/21263 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21526: [SPARK-24515][CORE] No need to warning when outpu...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21526#discussion_r202006002 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -357,6 +357,11 @@ package object config { .intConf .createWithDefault(256) + private[spark] val HADOOP_OUTPUTCOMMITCOORDINATION_ENABLED = +ConfigBuilder("spark.hadoop.outputCommitCoordination.enabled") + .booleanConf --- End diff -- Done @cloud-fan Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21526: [SPARK-24515][CORE] No need to warning when outpu...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21526#discussion_r202005902 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -357,6 +357,11 @@ package object config { .intConf .createWithDefault(256) + private[spark] val HADOOP_OUTPUTCOMMITCOORDINATION_ENABLED = +ConfigBuilder("spark.hadoop.outputCommitCoordination.enabled") + .booleanConf --- End diff -- Done @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21664: [SPARK-24687][CORE] NoClassDefFoundError will not be cat...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21664 The problem is that the job hanging,and user will be misleading since no more information print. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21552: [SPARK-24544][SQL] Print actual failure cause whe...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21552#discussion_r201970972 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala --- @@ -131,6 +132,8 @@ private[sql] class HiveSessionCatalog( Try(super.lookupFunction(funcName, children)) match { case Success(expr) => expr case Failure(error) => +logWarning(s"Encounter a failure during looking up function:" + + s" ${Utils.exceptionString(error)}") if (functionRegistry.functionExists(funcName)) { --- End diff -- I have update that set up the root cause for `NoSuchFunctionException`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21664: [SPARK-24687][CORE] NoClassDefFoundError will not be cat...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21664 @jerryshao I think at least we should print a warning,not let job hanging?Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21526: [SPARK-24515][CORE] No need to warning when outpu...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21526#discussion_r201963160 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -1053,7 +1053,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) // users that they may loss data if they are using a direct output committer. val speculationEnabled = self.conf.getBoolean("spark.speculation", false) val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "") -if (speculationEnabled && outputCommitterClass.contains("Direct")) { +val outputCommitCoordinationEnabled = self.conf.getBoolean( + "spark.hadoop.outputCommitCoordination.enabled", true) +if (speculationEnabled && outputCommitterClass.contains("Direct") + && !outputCommitCoordinationEnabled) { val warningMessage = --- End diff -- Also modify `HiveFileFormat`. @cloud-fan @jiangxb1987 And the reason i do not use an other common function to refactor this is that i can't find a good place to put the function.Any suggestion? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21526: [SPARK-24515][CORE] No need to warning when outpu...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21526#discussion_r201962635 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -1053,7 +1053,10 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) // users that they may loss data if they are using a direct output committer. val speculationEnabled = self.conf.getBoolean("spark.speculation", false) val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "") -if (speculationEnabled && outputCommitterClass.contains("Direct")) { +val outputCommitCoordinationEnabled = self.conf.getBoolean( + "spark.hadoop.outputCommitCoordination.enabled", true) --- End diff -- Update @cloud-fan Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21664: [SPARK-24687][CORE] NoClassDefFoundError will not be cat...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21664 @jiangxb1987 Thanks, i update the description of this pr.Stage 33 never be scheduled.and job not abort --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21664: [SPARK-24687][CORE] NoClassDefFoundError will not be cat...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21664 I have post an image on the jira @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21664: [SPARK-24678][CORE] NoClassDefFoundError will not...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21664#discussion_r200828384 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1049,6 +1049,13 @@ class DAGScheduler( abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e)) runningStages -= stage return + + case e: NoClassDefFoundError => --- End diff -- Actually,i will cause job hung since the state never update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21552: [SPARK-24544][SQL] Print actual failure cause when look ...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21552 @maropu May be i will do this check?As @cloud-fan mentioned. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21552: [SPARK-24544][SQL] Print actual failure cause whe...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21552#discussion_r200828333 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala --- @@ -131,6 +132,8 @@ private[sql] class HiveSessionCatalog( Try(super.lookupFunction(funcName, children)) match { case Success(expr) => expr case Failure(error) => +logWarning(s"Encounter a failure during looking up function:" + + s" ${Utils.exceptionString(error)}") if (functionRegistry.functionExists(funcName)) { --- End diff -- @viirya Thanks, i will set up the cause for `NoSuchFunctionException ` later --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21664: [SPARK-24678][CORE] NoClassDefFoundError will not be cat...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21664 It was caused by shading and jar missing.I will post an example later. Thanks @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21552: [SPARK-24544][SQL] Print actual failure cause whe...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21552#discussion_r200157051 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala --- @@ -131,6 +132,8 @@ private[sql] class HiveSessionCatalog( Try(super.lookupFunction(funcName, children)) match { case Success(expr) => expr case Failure(error) => +logWarning(s"Encounter a failure during looking up function:" + + s" ${Utils.exceptionString(error)}") if (functionRegistry.functionExists(funcName)) { --- End diff -- @maropu Any more suggestion?Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21664: [SPARK-24678][CORE] NoClassDefFoundError will not be cat...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21664 cc @jiangxb1987 @jerryshao Could you help review this?Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21664: [SPARK-24678][CORE] NoClassDefFoundError will not...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/21664 [SPARK-24678][CORE] NoClassDefFoundError will not be catch up which will ca⦠â¦use job hang ## What changes were proposed in this pull request? When NoClassDefFoundError thrown,it will cause job hang. `Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: Lcom/xxx/data/recommend/aggregator/queue/QueueName; at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2436) at java.lang.Class.getDeclaredField(Class.java:1946) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.(ObjectStreamClass.java:468) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365) at java.io.ObjectOutputStream.writeClass(ObjectOutputStream.java:1212) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1119) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)` It is caused by NoClassDefFoundError will not catch up during task seriazation. `var taskBinary: Broadcast[Array[Byte]] = null try { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => JavaUtils.bufferToArray( closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef)) case stage: ResultStage => JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef)) } taskBinary = sc.broadcast(taskBinaryBytes) } catch { // In the case of a failure during serialization, abort the stage. case e: NotSerializableException => abortStage(stage, "Task not serializable: " + e.toString, Some(e)) runningStages -= stage // Abort execution return case NonFatal(e) => abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e)) runningStages -= stage return }` ## How was this patch tested? UT You can merge this pull request into a Git re
[GitHub] spark pull request #21552: [SPARK-24544][SQL] Print actual failure cause whe...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/21552#discussion_r196378408 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala --- @@ -131,6 +132,8 @@ private[sql] class HiveSessionCatalog( Try(super.lookupFunction(funcName, children)) match { case Success(expr) => expr case Failure(error) => +logWarning(s"Encounter a failure during looking up function:" + + s" ${Utils.exceptionString(error)}") if (functionRegistry.functionExists(funcName)) { --- End diff -- Actually, this patch intended to print actual exception.As the description mentioned, this exception will helpful for troubleshooting. Thanks @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21552: [SPARK-24544][SQL] Print actual failure cause when look ...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21552 cc @cloud-fan Could help review this?Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21552: [SPARK-24544][SQL] Print actual failure cause whe...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/21552 [SPARK-24544][SQL] Print actual failure cause when look up function failed ## What changes were proposed in this pull request? When we operate as below: ` 0: jdbc:hive2://xxx/> create function funnel_analysis as 'com.xxx.hive.extend.udf.UapFunnelAnalysis'; ` ` 0: jdbc:hive2://xxx/> select funnel_analysis(1,",",1,''); Error: org.apache.spark.sql.AnalysisException: Undefined function: 'funnel_analysis'. This function is neither a registered temporary function nor a permanent function registered in the database 'xxx'.; line 1 pos 7 (state=,code=0) ` ` 0: jdbc:hive2://xxx/> describe function funnel_analysis; +---+--+ | function_desc | +---+--+ | Function: mifi.funnel_analysis| | Class: com.xxx.hive.extend.udf.UapFunnelAnalysis | | Usage: N/A. | +---+--+ ` We can see describe funtion will get right information,but when we actually use this funtion,we will get an undefined exception. Which is really misleading,the real cause is below: ` No handler for Hive UDF 'com.xiaomi.mifi.hive.extend.udf.UapFunnelAnalysis': java.lang.IllegalStateException: Should not be called directly; at org.apache.hadoop.hive.ql.udf.generic.GenericUDTF.initialize(GenericUDTF.java:72) at org.apache.spark.sql.hive.HiveGenericUDTF.outputInspector$lzycompute(hiveUDFs.scala:204) at org.apache.spark.sql.hive.HiveGenericUDTF.outputInspector(hiveUDFs.scala:204) at org.apache.spark.sql.hive.HiveGenericUDTF.elementSchema$lzycompute(hiveUDFs.scala:212) at org.apache.spark.sql.hive.HiveGenericUDTF.elementSchema(hiveUDFs.scala:212) ` This patch print the actual failure for quick debugging. ## How was this patch tested? UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/print-warning Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21552.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21552 commit 15d018a1e72ddf6cd29d6942359b9e6bd5547f67 Author: zhoukang Date: 2018-06-13T10:23:01Z [SPARK][SQL] Print actual failure cause when look up function failed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21526: [SPARK-24515][CORE] No need to warning when outpu...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/21526 [SPARK-24515][CORE] No need to warning when output commit coordination enabled ## What changes were proposed in this pull request? No need to warning user when output commit coordination enabled ``` // When speculation is on and output committer class name contains "Direct", we should warn // users that they may loss data if they are using a direct output committer. val speculationEnabled = self.conf.getBoolean("spark.speculation", false) val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "") if (speculationEnabled && outputCommitterClass.contains("Direct")) { val warningMessage = s"$outputCommitterClass may be an output committer that writes data directly to " + "the final location. Because speculation is enabled, this output committer may " + "cause data loss (see the case in SPARK-10063). If possible, please use an output " + "committer that does not have this behavior (e.g. FileOutputCommitter)." logWarning(warningMessage) } ``` ## How was this patch tested? UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/fix-warning Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21526.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21526 commit 6bac1531929e914764d980e4eb4228a10436876b Author: zhoukang Date: 2018-06-11T08:57:11Z [SPARK][CORE] No need to warning when output commit coordination enabled --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21263: [SPARK-24084][ThriftServer] Add job group id for ...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/21263 [SPARK-24084][ThriftServer] Add job group id for sql statement through spark-sql ## What changes were proposed in this pull request? Add job group id for spark-sql mode ## How was this patch tested? Exist UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/add-jobgroupid Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21263.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21263 commit 172d86b166c381e8cdb0b9853e5d6176bae9ccf8 Author: zhoukang Date: 2018-04-25T11:44:13Z [SPARK][Thrift Server] Add job group id for sql statement --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21151: [SPARK-24083][YARN] Log stacktrace for uncaught e...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/21151 [SPARK-24083][YARN] Log stacktrace for uncaught exception ## What changes were proposed in this pull request? Log stacktrace for uncaught exception ## How was this patch tested? UT and manually test You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/log-stacktrace Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21151.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21151 commit 805475c004ae1b30c37a38e5de1bdab888fdfd37 Author: zhoukang Date: 2018-04-25T07:36:20Z Log stacktrace for uncaught exception --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21127: [SPARK-24052][CORE][UI] Add spark version informa...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/21127 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21127: [SPARK-24052][CORE][UI] Add spark version information on...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21127 Ok i will close this pr. Thanks for your time @srowen @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21127: [SPARK-24052][CORE][UI] Add spark version information on...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21127 How about the other information? As mentioned,the build info @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21128: [SPARK-24053][CORE] Support add subdirectory name...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/21128 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21128: [SPARK-24053][CORE] Support add subdirectory named as us...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21128 Got it,thanks @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21127: [SPARK-24052][CORE][UI] Add spark version information on...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21127 It is version for the historyserver @vanzin May be i can reuse SparkBuildInfo? @srowen @vanzin Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21128: [SPARK-24053][CORE] Support add subdirectory named as us...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/21128 @vanzin Thanks! Since we set the common staging dir as ``` hdfs://xxx/spark/xxx/staging ``` This patch will give us a directory for given user like: ``` hdfs://xxx/spark/xxx/staging/u_zhoukang ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21128: [SPARK-24053][CORE] Support add subdirectory name...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/21128 [SPARK-24053][CORE] Support add subdirectory named as user name on staging directory ## What changes were proposed in this pull request? When we have multiple users on the same cluster,we can support add subdirectory which named as login user ## How was this patch tested? Exist UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/support-userlevel Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21128.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21128 commit b3190525367e750c26d56c1e61cfd3473e6a109a Author: zhoukang Date: 2018-04-23T10:45:20Z [SPARK] Support add subdirectory named as user name on staging directory --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21127: [SPARK-24052][CORE][UI] Add spark version informa...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/21127 [SPARK-24052][CORE][UI] Add spark version information on environment page ## What changes were proposed in this pull request? Since we may have multiple version in production cluster,we can showing some information on environment page like below: ![environment-page](https://user-images.githubusercontent.com/26762018/39121222-8b0bba5a-4723-11e8-8d52-b1a5ede9b0e7.png) ## How was this patch tested? Exist UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/add-sparkversion Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21127 commit a9c6563693e598f9f8babf899790751a6b73fd4c Author: zhoukang Date: 2018-04-23T10:17:50Z [SPARK][CORE][UI] Add spark version information on environment page --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20833: [SPARK-23692][SQL]Print metadata of files when in...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/20833 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20845: [SPARK-23708][CORE] Correct comment for function addShut...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20845 Get it! @jerryshao Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20845: [SPARK-23708][CORE] Correct comment for function addShut...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20845 @jiangxb1987 @Ngone51 Thanks! Any more comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20845: [SPARK-23708][CORE] Correct comment for function addShut...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20845 cc @cloud-fan @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20845: [SPARK-23708][CORE] Correct comment for function ...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/20845 [SPARK-23708][CORE] Correct comment for function addShutDownHook in ShutdownHookManager ## What changes were proposed in this pull request? Comment below is not right. ``` /** * Adds a shutdown hook with the given priority. Hooks with lower priority values run * first. * * @param hook The code to run during shutdown. * @return A handle that can be used to unregister the shutdown hook. */ def addShutdownHook(priority: Int)(hook: () => Unit): AnyRef = { shutdownHooks.add(priority, hook) } ``` ## How was this patch tested? UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/fix-shutdowncomment Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20845.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20845 commit 0199a30783dcef86c5e643fdbcf8b31d4675326b Author: zhoukang Date: 2018-03-16T09:21:58Z Correct comment for function addShutDownHook in ShutdownHookManager --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20833: [SPARK-23692][SQL]Print metadata of files when in...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/20833 [SPARK-23692][SQL]Print metadata of files when infer schema failed ## What changes were proposed in this pull request? A trivial modify. Currently, when we had no input files to infer schema,we will throw below exception. For some users it may be misleading.If we can print files' metadata it will be more clearer. `Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425) at com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18)` ## How was this patch tested? Exsist tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/modify-log Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20833.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20833 commit 91e53d87b0f5503ba7e9c9bb6a7258ef30f87c9d Author: zhoukang Date: 2018-03-15T08:53:06Z Print metadata of files when infer schema failed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20508: [SPARK-23335][SQL] Should not convert to double w...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/20508 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20667 Thanks @cloud-fan @jiangxb1987 @kiszk @Ngone51 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case bl...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20667#discussion_r171121778 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -132,10 +133,15 @@ private[spark] object BlockManagerId { getCachedBlockManagerId(obj) } - val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() + val blockManagerIdCache = CacheBuilder.newBuilder() +.maximumSize(500) --- End diff -- Thanks @jiangxb1987 i have updated the comment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case bl...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20667#discussion_r170881808 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -132,10 +133,15 @@ private[spark] object BlockManagerId { getCachedBlockManagerId(obj) } - val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() + val blockManagerIdCache = CacheBuilder.newBuilder() +.maximumSize(500) --- End diff -- Actually i think 500 executors can handle most applications.And for historyserver it is no need to cache too much `BlockManagerId`.If we set this number as 50 the max size of cache will below 30KB. Agree with that? @jiangxb1987 If ok i will update documentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case bl...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20667#discussion_r170864215 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -132,10 +133,15 @@ private[spark] object BlockManagerId { getCachedBlockManagerId(obj) } - val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() + val blockManagerIdCache = CacheBuilder.newBuilder() --- End diff -- I think it is thread-safe which i refer from: https://google.github.io/guava/releases/22.0/api/docs/com/google/common/cache/LoadingCache.html and https://stackoverflow.com/questions/11124856/using-guava-for-high-performance-thread-safe-caching --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case bl...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20667#discussion_r170817107 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -132,10 +133,15 @@ private[spark] object BlockManagerId { getCachedBlockManagerId(obj) } - val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() + val blockManagerIdCache = CacheBuilder.newBuilder() +.maximumSize(500) --- End diff -- here i set 500 since `blockmanagerId` about `48B` per object. I do not use spark conf since it is not convenient to get spark conf for historyserver when use BlockManagerId --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20667 Update @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Use softreference for BlockmanagerId...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20667 Nice @jiangxb1987 @cloud-fan I will modify later.Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Use timeStampedHashMap for Bl...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20667#discussion_r170516775 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -132,10 +133,15 @@ private[spark] object BlockManagerId { getCachedBlockManagerId(obj) } - val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() + val blockManagerIdCache = new TimeStampedHashMap[BlockManagerId, BlockManagerId](true) - def getCachedBlockManagerId(id: BlockManagerId): BlockManagerId = { + def getCachedBlockManagerId(id: BlockManagerId, clearOldValues: Boolean = false): BlockManagerId = + { blockManagerIdCache.putIfAbsent(id, id) -blockManagerIdCache.get(id) +val blockManagerId = blockManagerIdCache.get(id) +if (clearOldValues) { + blockManagerIdCache.clearOldValues(System.currentTimeMillis - Utils.timeStringAsMs("10d")) --- End diff -- @Ngone51 Thanks.i also though about remove when we delete a block. In this case, it is history replaying which will trigger this problem,and we do not delete any block actually. Maybe use `weakreference` better as @jiangxb1987 mentioned?WDYT? Thanks again! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Use timeStampedHashMap for Blockmana...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20667 @cloud-fan I just find commit log below `Modified StorageLevel and BlockManagerId to cache common objects and use cached object while deserializing.` I can't figure out why we need cache, since i think the cache miss may be high? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Use timeStampedHashMap for Bl...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/20667 [SPARK-23508][CORE] Use timeStampedHashMap for BlockmanagerId in case blockManagerIdCache⦠⦠cause oom ## What changes were proposed in this pull request? blockManagerIdCache in BlockManagerId will not remove old values which may cause oom `val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()` Since whenever we apply a new BlockManagerId, it will put into this map. This patch will use timestampHashMap instead for `JsonProtocol`. ## How was this patch tested? Exist tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/fix-history Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20667.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20667 commit fc1b6a0169c123a825a253defb021c73aebf1c98 Author: zhoukang Date: 2018-02-24T10:13:01Z Use timeStampedHashMap for BlockmanagerId in case blockManagerIdCache cause oom --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20508: [SPARK-23335][SQL] Should not convert to double w...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20508#discussion_r166839312 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -327,6 +327,14 @@ object TypeCoercion { // Skip nodes who's children have not been resolved yet. case e if !e.childrenResolved => e + // For integralType should not convert to double which will cause precision loss. + case a @ BinaryArithmetic(left @ StringType(), right @ IntegralType()) => --- End diff -- @wangyum Sorry for bothering you, i will take some time to fix this later . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20508: [SPARK-23335][SQL] Should not convert to double w...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20508#discussion_r166182782 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -327,6 +327,14 @@ object TypeCoercion { // Skip nodes who's children have not been resolved yet. case e if !e.childrenResolved => e + // For integralType should not convert to double which will cause precision loss. + case a @ BinaryArithmetic(left @ StringType(), right @ IntegralType()) => --- End diff -- Thanks @wangyum , it will return `NULL`. I modify to use `DecimalType.SYSTEM_DEFAULT` instead. I consider to check value, but i think `DecimalType.SYSTEM_DEFAULT` is enough.What do you think about this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20508: [SPARK-23335][SQL] Should not convert to double w...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/20508 [SPARK-23335][SQL] Should not convert to double when there is an Integra⦠â¦l value in BinaryArithmetic which will loss precison ## What changes were proposed in this pull request? For below expression: `select conv('',16,10) % 2;` it will return 0. ``` 0: jdbc:hive2://xxx:16> select conv('',16,10) % 2; +--+--+ | (CAST(conv(, 16, 10) AS DOUBLE) % CAST(CAST(2 AS DECIMAL(20,0)) AS DOUBLE)) | +--+--+ | 0.0 | +--+--+ ``` It caused by: ``` case a @ BinaryArithmetic(left @ StringType(), right) => a.makeCopy(Array(Cast(left, DoubleType), right)) case a @ BinaryArithmetic(left, right @ StringType()) => a.makeCopy(Array(left, Cast(right, DoubleType))) ``` This patch fix this by add rule check when has an intergral type in BinaryArithmetic operator,we should not convert value to double. Result as below: ``` 0: jdbc:hive2://xxx:16> select conv('',16,10) % 2; +---+--+ | (CAST(CAST(conv(, 16, 10) AS DECIMAL(38,0)) AS DECIMAL(38,0)) % CAST(CAST(2 AS DECIMAL(38,0)) AS DECIMAL(38,0))) | +---+--+ | 1 | +---+--+ ``` ## How was this patch tested? Exist tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/fix-castasdouble Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20508.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20508 commit 1a2c62f6e2725cbbdc44c464c7fc0b9358e064b2 Author: zhoukang Date: 2018-02-05T10:52:40Z [SPARK-MI][SQL] Should not convert to double when there is an Integral value in BinaryArithmetic which will loss precison --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20292: [SPARK-23129][CORE] Make deserializeStream of DiskMapIte...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20292 Thanks for your time! @cloud-fan @jerryshao @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20292: [SPARK-23129][CORE] Make deserializeStream of Dis...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20292#discussion_r163160055 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -463,21 +463,21 @@ class ExternalAppendOnlyMap[K, V, C]( // An intermediate stream that reads from exactly one batch // This guards against pre-fetching and other arbitrary behavior of higher level streams -private var deserializeStream = nextBatchStream() +private var deserializeStream: Option[DeserializationStream] = None --- End diff -- Ok,i will fix later @jiangxb1987 @cloud-fan Thanks for your precious time --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20292: [SPARK-23129][CORE] Make deserializeStream of Dis...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20292#discussion_r163146577 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -463,21 +463,21 @@ class ExternalAppendOnlyMap[K, V, C]( // An intermediate stream that reads from exactly one batch // This guards against pre-fetching and other arbitrary behavior of higher level streams -private var deserializeStream = nextBatchStream() +private var deserializeStream: Option[DeserializationStream] = None --- End diff -- Thanks @jiangxb1987 But this may make deserializeStream still init when spilling and generate DiskMapIterator --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20292: [SPARK-23129][CORE] Make deserializeStream of DiskMapIte...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20292 Ping @jiangxb1987 could you help review this?thanks too much! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20292: [SPARK-23129][CORE] Make deserializeStream of Dis...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20292#discussion_r162235896 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -509,8 +509,8 @@ class ExternalAppendOnlyMap[K, V, C]( */ private def readNextItem(): (K, C) = { try { -val k = deserializeStream.readKey().asInstanceOf[K] -val c = deserializeStream.readValue().asInstanceOf[C] +val k = deserializeStream.get.readKey().asInstanceOf[K] --- End diff -- Here may throw NoSuchElementException.I will fix this.Thanks @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20292: [SPARK-23129][CORE] Make deserializeStream of Dis...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20292#discussion_r162235297 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -509,8 +509,8 @@ class ExternalAppendOnlyMap[K, V, C]( */ private def readNextItem(): (K, C) = { try { -val k = deserializeStream.readKey().asInstanceOf[K] -val c = deserializeStream.readValue().asInstanceOf[C] +val k = deserializeStream.get.readKey().asInstanceOf[K] --- End diff -- @jiangxb1987 I think this does not change the original semantic,since it is only call `cleanup` when `EOFException` was thrown.May i missed something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20292: [SPARK-23129][CORE] Make deserializeStream of Dis...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/20292 [SPARK-23129][CORE] Make deserializeStream of DiskMapIterator init lazily ## What changes were proposed in this pull request? Currently,the deserializeStream in ExternalAppendOnlyMap#DiskMapIterator init when DiskMapIterator instance created.This will cause memory use overhead when ExternalAppendOnlyMap spill too much times. We can avoid this by making deserializeStream init when it is used the first time. This patch make deserializeStream init lazily. ## How was this patch tested? Exist tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/lay-diskmapiterator Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20292.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20292 commit d2bbbe1677202ae73046f12573d96a07e3deeb31 Author: zhoukang Date: 2018-01-17T09:33:07Z [SPARK][CORE] Make deserializeStream of DiskMapIterator init lazily --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19764: [SPARK-22539][SQL] Add second order for rangepart...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/19764 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19764 Thanks @hvanhovell .IIUC, you mean that adding second order will break the contract of the range partitioner generated by `ShuffleExchange` ?If that, i agree with the opinionã And sorry for that i can't figure out why `select * from tbl_x order by col1` get invalid result? May be i missed some other knowledge.Thanks again! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19764 Ping @srowen can you help review this?Thanks too much --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156697267 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -564,6 +564,14 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED) require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED), s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.") + +val executorTimeoutThreshold = getTimeAsSeconds("spark.network.timeout", "120s") +val executorHeartbeatInterval = getTimeAsSeconds("spark.executor.heartbeatInterval", "10s") +// If spark.executor.heartbeatInterval bigger than spark.network.timeout, +// it will almost always cause ExecutorLostFailure. See SPARK-22754. +require(executorTimeoutThreshold > executorHeartbeatInterval, "The value of " + + s"'spark.network.timeout=${executorTimeoutThreshold}' must be no less than the value of " + --- End diff -- Agree with you @srowen Does new message make sense? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156627509 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -564,6 +564,15 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED) require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED), s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.") + +val executorTimeoutThreshold = Utils.timeStringAsSeconds(get("spark.network.timeout", "120s")) --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156627486 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -564,6 +564,15 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED) require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED), s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.") + +val executorTimeoutThreshold = Utils.timeStringAsSeconds(get("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + get("spark.executor.heartbeatInterval", "10s")) +// If spark.executor.heartbeatInterval bigger than spark.network.timeout, +// it will almost always cause ExecutorLostFailure.See SPARK-22754. --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156272847 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -564,6 +564,13 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED) require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED), s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.") + +val executorTimeoutThreshold = Utils.timeStringAsSeconds(get("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + get("spark.executor.heartbeatInterval", "10s")) +require(executorHeartbeatInterval > executorTimeoutThreshold, s"The value of" + --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156272829 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -564,6 +564,13 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED) require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED), s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.") + +val executorTimeoutThreshold = Utils.timeStringAsSeconds(get("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + get("spark.executor.heartbeatInterval", "10s")) +require(executorHeartbeatInterval > executorTimeoutThreshold, s"The value of" + + s"spark.network.timeout' must be no less than the value of" + --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156272812 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -564,6 +564,13 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED) require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED), s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.") + +val executorTimeoutThreshold = Utils.timeStringAsSeconds(get("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + get("spark.executor.heartbeatInterval", "10s")) +require(executorHeartbeatInterval > executorTimeoutThreshold, s"The value of" + + s"spark.network.timeout' must be no less than the value of" + + s" 'spark.executor.heartbeatInterval'.") --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156270767 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -564,6 +564,13 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED) require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED), s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.") + +val executorTimeoutThreshold = Utils.timeStringAsSeconds(get("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + get("spark.executor.heartbeatInterval", "10s")) +require(executorHeartbeatInterval > executorTimeoutThreshold, s"The value of" + + s"spark.network.timeout' must be no less than the value of" + + s" 'spark.executor.heartbeatInterval'.") --- End diff -- Sorry for that,i will check more carefully! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156268382 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -564,6 +564,18 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria val encryptionEnabled = get(NETWORK_ENCRYPTION_ENABLED) || get(SASL_ENCRYPTION_ENABLED) require(!encryptionEnabled || get(NETWORK_AUTH_ENABLED), s"${NETWORK_AUTH_ENABLED.key} must be enabled when enabling encryption.") + +val executorTimeoutThreshold = Utils.timeStringAsSeconds(get("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + get("spark.executor.heartbeatInterval", "10s")) +if (executorHeartbeatInterval > executorTimeoutThreshold) { --- End diff -- Done @jiangxb1987 Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156264833 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -291,6 +291,18 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S if (proxyUser != null && principal != null) { SparkSubmit.printErrorAndExit("Only one of --proxy-user or --principal can be provided.") } + +val executorTimeoutThreshold = Utils.timeStringAsSeconds( + sparkProperties.getOrElse("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + sparkProperties.getOrElse("spark.executor.heartbeatInterval", "10s")) --- End diff -- Done @vanzin Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156250974 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -291,6 +291,18 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S if (proxyUser != null && principal != null) { SparkSubmit.printErrorAndExit("Only one of --proxy-user or --principal can be provided.") } + +val executorTimeoutThreshold = Utils.timeStringAsSeconds( + sparkProperties.getOrElse("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + sparkProperties.getOrElse("spark.executor.heartbeatInterval", "10s")) --- End diff -- I have thought about add a config constant,but it will affect many other codes,so i simply change here. @srowen @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19942#discussion_r156105135 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -291,6 +291,18 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S if (proxyUser != null && principal != null) { SparkSubmit.printErrorAndExit("Only one of --proxy-user or --principal can be provided.") } + +val executorTimeoutThreshold = Utils.timeStringAsSeconds( + sparkProperties.getOrElse("spark.network.timeout", "120s")) +val executorHeartbeatInterval = Utils.timeStringAsSeconds( + sparkProperties.getOrElse("spark.executor.heartbeatInterval", "10s")) --- End diff -- Since some one may only set spark.executor.heartbeatInterval without setting spark.network.timeout.So i just get with default value.Does this make sense?Since we may not need to check if both are not set. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19942: [SPARK-22754][DEPLOY] Check whether spark.executo...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/19942 [SPARK-22754][DEPLOY] Check whether spark.executor.heartbeatInterval bigger⦠⦠than spark.network.timeout or not ## What changes were proposed in this pull request? If spark.executor.heartbeatInterval bigger than spark.network.timeout,it will almost always cause exception below. `Job aborted due to stage failure: Task 4763 in stage 3.0 failed 4 times, most recent failure: Lost task 4763.3 in stage 3.0 (TID 22383, executor id: 4761, host: c3-hadoop-prc-st2966.bj): ExecutorLostFailure (executor 4761 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 154022 ms` Since many users do not get that point.He will set spark.executor.heartbeatInterval incorrectly. This patch check this case when submit applications. ## How was this patch tested? Test in cluster You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/check-heartbeat Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19942.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19942 commit 891a092ac3a95f32cb2f9e1c215b1c8324c98971 Author: zhoukang Date: 2017-12-11T12:48:32Z [SPARK][DEPLOY] Check whether spark.executor.heartbeatInterval bigger than spark.network.timeout or not --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19764 Ping --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19764 @gczsjdy Added a simple example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19764 @gczsjdy Ok,i will post later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19764 Ping any admin help review this?Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19764 Firstly , thanks too much @hvanhovell . And sorry for replying so late since i have some other things to handle during these time. For the question, i think the ordering will not be broken.I did not modify the 'RangePartitioner' itself.But optimize the partitioner strategy passed from sql.And this will not break the 'Partitioner' contract since the same key will still map to the same partition. Notice that key here contains attribute added by second order. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19764: [SPARK-22539][SQL] Add second order for rangepart...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19764#discussion_r151375208 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -193,14 +193,30 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR /** * A lazily generated row ordering comparator. */ -class LazilyGeneratedOrdering(val ordering: Seq[SortOrder]) +class LazilyGeneratedOrdering( +val ordering: Seq[SortOrder], secondOrdering: => Seq[SortOrder] = Seq.empty) extends Ordering[InternalRow] with KryoSerializable { def this(ordering: Seq[SortOrder], inputSchema: Seq[Attribute]) = this(ordering.map(BindReferences.bindReference(_, inputSchema))) @transient - private[this] var generatedOrdering = GenerateOrdering.generate(ordering) + private[this] var generatedOrdering = { +var generatedOrdering = GenerateOrdering.generate(ordering) +if (!secondOrdering.isEmpty) { + secondOrdering.foreach { order => +try { + GenerateOrdering.generate(Seq(order)) + ordering ++ Seq(order) --- End diff -- Update @srowen Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19764: [SPARK-22539][SQL] Add second order for rangepart...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19764#discussion_r151356810 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -193,14 +193,30 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR /** * A lazily generated row ordering comparator. */ -class LazilyGeneratedOrdering(val ordering: Seq[SortOrder]) +class LazilyGeneratedOrdering( +val ordering: Seq[SortOrder], secondOrdering: => Seq[SortOrder] = Seq.empty) extends Ordering[InternalRow] with KryoSerializable { def this(ordering: Seq[SortOrder], inputSchema: Seq[Attribute]) = this(ordering.map(BindReferences.bindReference(_, inputSchema))) @transient - private[this] var generatedOrdering = GenerateOrdering.generate(ordering) + private[this] var generatedOrdering = { +var generatedOrdering = GenerateOrdering.generate(ordering) +if (!secondOrdering.isEmpty) { + secondOrdering.foreach { order => --- End diff -- Logic here used to filter input order which can not generate ordering normally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19764: [SPARK-22539][SQL] Add second order for rangepart...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/19764 [SPARK-22539][SQL] Add second order for rangepartitioner since partition nu⦠â¦mber may be small if the specified key is skewed ## What changes were proposed in this pull request? The rangepartitioner generated from shuffle exchange may cause partiton skew if sort key is skewed. This patch add second order for rangepartitioner to avoid this situation. This is an improvement from real case. ## How was this patch tested? Manully test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/add-secondorder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19764.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19764 commit 29d2c869ffc6dd4d1a3cf7606cde5d03e72fa171 Author: zhoukang Date: 2017-11-15T07:24:59Z [SPARK][SQL] Add second order for rangepartitioner since partition number may be small if the specified key is skewed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19625: [SPARK-22407][WEB-UI] Add rdd id column on storage page ...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19625 @HyukjinKwon @jiangxb1987 @srowen Any more problem?Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19625: [SPARK-22407][WEB-UI] Add rdd id column on storag...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19625#discussion_r148427237 --- Diff: core/src/main/scala/org/apache/spark/ui/storage/StoragePage.scala --- @@ -49,6 +49,7 @@ private[ui] class StoragePage(parent: StorageTab) extends WebUIPage("") { /** Header fields for the RDD table */ private val rddHeader = Seq( +"RDD ID", --- End diff -- Thanks @srowen .I will update right now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19625: [SPARK-22407][WEB-UI] Add rdd id column on storag...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/19625 [SPARK-22407][WEB-UI] Add rdd id column on storage page to speed up navigating ## What changes were proposed in this pull request? Add rdd id column on storage page to speed up navigating. Example has attached on [SPARK-22407](https://issues.apache.org/jira/browse/SPARK-22407) ## How was this patch tested? Current unit test and manually deploy an history server for testing You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/add-rddid Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19625.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19625 commit fdb13985e5d42a7d1090cc6438c4c34f7b75a7e7 Author: zhoukang Date: 2017-11-01T12:05:49Z Add rdd id on storage page to speed up navigating --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19482: [SPARK-22264][DEPLOY] Add timeout for eventlog replaying...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19482 Okï¼thanks @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19482: [SPARK-22264][DEPLOY] Add timeout for eventlog re...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/19482 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19482: [SPARK-22264][DEPLOY] Add timeout for eventlog replaying...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19482 Cloud you help review this? @vanzin Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19482: [SPARK-22264][DEPLOY] Add timeout for eventlog re...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/19482 [SPARK-22264][DEPLOY] Add timeout for eventlog replaying to avoid time-cons⦠â¦uming replaying which will cause historyserver unavailable ## What changes were proposed in this pull request? History server will be unavailable if there is an event log file with large size.Large size here means the replaying time is too long. We can fix this to add a timeout for event log replaying. More details attached in [SPARK-22264](https://issues.apache.org/jira/browse/SPARK-22264) ## How was this patch tested? Exsisted unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/replay-timeout Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19482.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19482 commit a2423dda5a90741691926a40d99b085795e81fbb Author: zhoukang Date: 2017-10-12T09:47:32Z [SPARK][DEPLOY] Add timeout for eventlog replaying to avoid time-consuming replaying which will cause historyserver unavailable --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19399: [SPARK-22175][WEB-UI] Add status column to histor...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/19399 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19399 All right, thanks for the comments and i agree with you.I will close this one. @jerryshao @ajbozarth @vanzin And try other solutions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19399: [SPARK-22175][WEB-UI] Add status column to history page
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19399 Ok i will wait for SPARK-18085 and think about log status more accurately @squito @ajbozarth Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19399: [SPARK-22175][WEB-UI] Add status column to histor...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19399#discussion_r143114938 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -850,6 +869,18 @@ private[history] class AppListingListener(log: FileStatus, clock: Clock) extends fileSize) } +def applicationStatus : Option[String] = { + if (startTime.getTime == -1) { +Some("") + } else if (endTime.getTime == -1) { +Some("") + } else if (jobToStatus.isEmpty || jobToStatus.exists(_._2 != "Succeeded")) { --- End diff -- Yes agree with you @squito , actually i have thought about the accurate status but i finally chose to use existed event to do this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org