[jira] [Commented] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095124#comment-17095124 ] angerszhu commented on SPARK-31602: --- cc [~cloud_fan] > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > !image-2020-04-29-14-35-55-986.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31602: -- Description: !image-2020-04-29-14-34-39-496.png! !image-2020-04-29-14-35-55-986.png! was: !image-2020-04-29-14-34-39-496.png! Screen Shot 2020-04-29 at 2.08.28 PM > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > !image-2020-04-29-14-35-55-986.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31602: -- Description: !image-2020-04-29-14-34-39-496.png! Screen Shot 2020-04-29 at 2.08.28 PM was: !image-2020-04-29-14-30-46-213.png! !image-2020-04-29-14-30-55-964.png! > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > Screen Shot 2020-04-29 at 2.08.28 PM -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31602: -- Attachment: image-2020-04-29-14-35-55-986.png > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > Screen Shot 2020-04-29 at 2.08.28 PM -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31602: -- Attachment: image-2020-04-29-14-34-39-496.png > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png > > > !image-2020-04-29-14-30-46-213.png! > > !image-2020-04-29-14-30-55-964.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095122#comment-17095122 ] angerszhu commented on SPARK-31602: --- In HadoopRDD , if you don't set spark.hadoop.cloneConf=true, it will put new JobConf to cached metadata and won't remove, maybe we should add a clear method? {code:java} // Returns a JobConf that will be used on slaves to obtain input splits for Hadoop reads. protected def getJobConf(): JobConf = { val conf: Configuration = broadcastedConf.value.value if (shouldCloneJobConf) { // Hadoop Configuration objects are not thread-safe, which may lead to various problems if // one job modifies a configuration while another reads it (SPARK-2546). This problem occurs // somewhat rarely because most jobs treat the configuration as though it's immutable. One // solution, implemented here, is to clone the Configuration object. Unfortunately, this // clone can be very expensive. To avoid unexpected performance regressions for workloads and // Hadoop versions that do not suffer from these thread-safety issues, this cloning is // disabled by default. HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized { logDebug("Cloning Hadoop Configuration") val newJobConf = new JobConf(conf) if (!conf.isInstanceOf[JobConf]) { initLocalJobConfFuncOpt.foreach(f => f(newJobConf)) } newJobConf } } else { if (conf.isInstanceOf[JobConf]) { logDebug("Re-using user-broadcasted JobConf") conf.asInstanceOf[JobConf] } else { Option(HadoopRDD.getCachedMetadata(jobConfCacheKey)) .map { conf => logDebug("Re-using cached JobConf") conf.asInstanceOf[JobConf] } .getOrElse { // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in // the local process. The local cache is accessed through HadoopRDD.putCachedMetadata(). // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary // objects. Synchronize to prevent ConcurrentModificationException (SPARK-1097, // HADOOP-10456). HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized { logDebug("Creating new JobConf and caching it for later re-use") val newJobConf = new JobConf(conf) initLocalJobConfFuncOpt.foreach(f => f(newJobConf)) HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf) newJobConf } } } } } {code} No remove for this cached Job metadata {code:java} /** * The three methods below are helpers for accessing the local map, a property of the SparkEnv of * the local process. */ def getCachedMetadata(key: String): Any = SparkEnv.get.hadoopJobMetadata.get(key) private def putCachedMetadata(key: String, value: Any): Unit = SparkEnv.get.hadoopJobMetadata.put(key, value) {code} for SQL on hive data, each partition will generate one JobConf, it's heave. > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png > > > !image-2020-04-29-14-30-46-213.png! > > !image-2020-04-29-14-30-55-964.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31602) memory leak of JobConf
angerszhu created SPARK-31602: - Summary: memory leak of JobConf Key: SPARK-31602 URL: https://issues.apache.org/jira/browse/SPARK-31602 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: angerszhu !image-2020-04-29-14-30-46-213.png! !image-2020-04-29-14-30-55-964.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31334) Use agg column in Having clause behave different with column type
[ https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31334: -- Comment: was deleted (was: cc [~cloud_fan] [~yumwang] ) > Use agg column in Having clause behave different with column type > -- > > Key: SPARK-31334 > URL: https://issues.apache.org/jira/browse/SPARK-31334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > ``` > test("") { > Seq( > (1, 3), > (2, 3), > (3, 6), > (4, 7), > (5, 9), > (6, 9) > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > [info] - *** FAILED *** (508 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 > missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, > sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. > Attribute(s) with the same name appear in the operation: a. Please check if > the right attribute(s) are used.;; > [info] Project [b#181, a#184] > [info] +- Filter (sum(a#184)#188 > cast(3 as double)) > [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, > sum(a#184) AS sum(a#184)#188] > [info] +- SubqueryAlias `testdata` > [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] > [info] +- LocalRelation [_1#177, _2#178] > ``` > ``` > test("") { > Seq( > ("1", "3"), > ("2", "3"), > ("3", "6"), > ("4", "7"), > ("5", "9"), > ("6", "9") > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > == Physical Plan == > *(2) Project [b#181, a#184L] > +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 > as bigint))#197L > 3)) >+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) > +- Exchange hashpartitioning(b#181, 5) > +- *(1) HashAggregate(keys=[b#181], > functions=[partial_sum(cast(a#180 as bigint))]) > +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] >+- LocalTableScan [_1#177, _2#178] > ```{code} > Spend A lot of time I can't find witch analyzer make this different, > When column type is double, it failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31334) Use agg column in Having clause behave different with column type
[ https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074381#comment-17074381 ] angerszhu commented on SPARK-31334: --- I have found the reason, In analyzer when logical plan {code:java} 'Filter ('sum('a) > 3) +- Aggregate [b#181], [b#181, sum(a#180) AS a#184L] +- SubqueryAlias `testdata` +- Project [_1#177 AS a#180, _2#178 AS b#181] +- LocalRelation [_1#177, _2#178] {code} come into ResolveAggregateFunctions, since a is String type and then aggregation's expression is unresolved, so ResolveAggregateFunctions won't make a change on above logicalplan, then `sum(a)` in Filter condition will be resolved in ResolveReference and this {color:#FF}a {color}{color:#172b4d}will be resolved as aggregation's output column a , then error happened{color}{color:#FF} .{color} > Use agg column in Having clause behave different with column type > -- > > Key: SPARK-31334 > URL: https://issues.apache.org/jira/browse/SPARK-31334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > ``` > test("") { > Seq( > (1, 3), > (2, 3), > (3, 6), > (4, 7), > (5, 9), > (6, 9) > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > [info] - *** FAILED *** (508 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 > missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, > sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. > Attribute(s) with the same name appear in the operation: a. Please check if > the right attribute(s) are used.;; > [info] Project [b#181, a#184] > [info] +- Filter (sum(a#184)#188 > cast(3 as double)) > [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, > sum(a#184) AS sum(a#184)#188] > [info] +- SubqueryAlias `testdata` > [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] > [info] +- LocalRelation [_1#177, _2#178] > ``` > ``` > test("") { > Seq( > ("1", "3"), > ("2", "3"), > ("3", "6"), > ("4", "7"), > ("5", "9"), > ("6", "9") > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > == Physical Plan == > *(2) Project [b#181, a#184L] > +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 > as bigint))#197L > 3)) >+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) > +- Exchange hashpartitioning(b#181, 5) > +- *(1) HashAggregate(keys=[b#181], > functions=[partial_sum(cast(a#180 as bigint))]) > +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] >+- LocalTableScan [_1#177, _2#178] > ```{code} > Spend A lot of time I can't find witch analyzer make this different, > When column type is double, it failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31334) Use agg column in Having clause behave different with column type
[ https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074296#comment-17074296 ] angerszhu commented on SPARK-31334: --- cc [~cloud_fan] [~yumwang] > Use agg column in Having clause behave different with column type > -- > > Key: SPARK-31334 > URL: https://issues.apache.org/jira/browse/SPARK-31334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > ``` > test("") { > Seq( > (1, 3), > (2, 3), > (3, 6), > (4, 7), > (5, 9), > (6, 9) > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > [info] - *** FAILED *** (508 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 > missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, > sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. > Attribute(s) with the same name appear in the operation: a. Please check if > the right attribute(s) are used.;; > [info] Project [b#181, a#184] > [info] +- Filter (sum(a#184)#188 > cast(3 as double)) > [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, > sum(a#184) AS sum(a#184)#188] > [info] +- SubqueryAlias `testdata` > [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] > [info] +- LocalRelation [_1#177, _2#178] > ``` > ``` > test("") { > Seq( > ("1", "3"), > ("2", "3"), > ("3", "6"), > ("4", "7"), > ("5", "9"), > ("6", "9") > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > == Physical Plan == > *(2) Project [b#181, a#184L] > +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 > as bigint))#197L > 3)) >+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) > +- Exchange hashpartitioning(b#181, 5) > +- *(1) HashAggregate(keys=[b#181], > functions=[partial_sum(cast(a#180 as bigint))]) > +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] >+- LocalTableScan [_1#177, _2#178] > ```{code} > Spend A lot of time I can't find witch analyzer make this different, > When column type is double, it failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31334) Use agg column in Having clause behave different with column type
[ https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31334: -- Description: {code:java} ``` test("") { Seq( (1, 3), (2, 3), (3, 6), (4, 7), (5, 9), (6, 9) ).toDF("a", "b").createOrReplaceTempView("testData") val x = sql( """ | SELECT b, sum(a) as a | FROM testData | GROUP BY b | HAVING sum(a) > 3 """.stripMargin) x.explain() x.show() } [info] - *** FAILED *** (508 milliseconds) [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. Attribute(s) with the same name appear in the operation: a. Please check if the right attribute(s) are used.;; [info] Project [b#181, a#184] [info] +- Filter (sum(a#184)#188 > cast(3 as double)) [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188] [info] +- SubqueryAlias `testdata` [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] [info] +- LocalRelation [_1#177, _2#178] ``` ``` test("") { Seq( ("1", "3"), ("2", "3"), ("3", "6"), ("4", "7"), ("5", "9"), ("6", "9") ).toDF("a", "b").createOrReplaceTempView("testData") val x = sql( """ | SELECT b, sum(a) as a | FROM testData | GROUP BY b | HAVING sum(a) > 3 """.stripMargin) x.explain() x.show() } == Physical Plan == *(2) Project [b#181, a#184L] +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 as bigint))#197L > 3)) +- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) +- Exchange hashpartitioning(b#181, 5) +- *(1) HashAggregate(keys=[b#181], functions=[partial_sum(cast(a#180 as bigint))]) +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] +- LocalTableScan [_1#177, _2#178] ```{code} Spend A lot of time I can't find witch analyzer make this different, When column type is double, it failed. > Use agg column in Having clause behave different with column type > -- > > Key: SPARK-31334 > URL: https://issues.apache.org/jira/browse/SPARK-31334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > ``` > test("") { > Seq( > (1, 3), > (2, 3), > (3, 6), > (4, 7), > (5, 9), > (6, 9) > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > [info] - *** FAILED *** (508 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 > missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, > sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. > Attribute(s) with the same name appear in the operation: a. Please check if > the right attribute(s) are used.;; > [info] Project [b#181, a#184] > [info] +- Filter (sum(a#184)#188 > cast(3 as double)) > [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, > sum(a#184) AS sum(a#184)#188] > [info] +- SubqueryAlias `testdata` > [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] > [info] +- LocalRelation [_1#177, _2#178] > ``` > ``` > test("") { > Seq( > ("1", "3"), > ("2", "3"), > ("3", "6"), > ("4", "7"), > ("5", "9"), > ("6", "9") > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > == Physical Plan == > *(2) Project [b#181, a#184L] > +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 > as bigint))#197L > 3)) >+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) > +- Exchange hashpartitioning(b#181, 5) > +- *(1) HashAggregate(keys=[b#181], > functions=[partial_sum(cast(a#180 as bigint))]) > +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] >+- LocalTableScan [_1#177, _2#178] > ```{code} > Spend A lot of time I can't find witch analyzer make this different, > When column type is double, it failed. -- This message was sent by
[jira] [Created] (SPARK-31334) Use agg column in Having clause behave different with column type
angerszhu created SPARK-31334: - Summary: Use agg column in Having clause behave different with column type Key: SPARK-31334 URL: https://issues.apache.org/jira/browse/SPARK-31334 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31298) validate CTAS table path in SPARK-19724 seems conflict and External table also need to check non-empty
[ https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31298: -- Summary: validate CTAS table path in SPARK-19724 seems conflict and External table also need to check non-empty (was: validate CTAS table path in SPARK-19724 seems conflict) > validate CTAS table path in SPARK-19724 seems conflict and External table > also need to check non-empty > -- > > Key: SPARK-31298 > URL: https://issues.apache.org/jira/browse/SPARK-31298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > In SessionCatalog.validateTableLocation() > {code:java} > val tableLocation = > new > Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier))) > {code} > But in CreateDataSourceTableAsSelect , table location use defaultTablePath > {code:java} > assert(table.schema.isEmpty) > sparkSession.sessionState.catalog.validateTableLocation(table) > val tableLocation = if (table.tableType == CatalogTableType.MANAGED) { > Some(sessionState.catalog.defaultTablePath(table.identifier)) > } else { > table.storage.locationUri > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31298) validate CTAS table path in SPARK-19724 seems conflict
[ https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31298: -- Summary: validate CTAS table path in SPARK-19724 seems conflict (was: validate External path in SPARK-19724 seems conflict) > validate CTAS table path in SPARK-19724 seems conflict > -- > > Key: SPARK-31298 > URL: https://issues.apache.org/jira/browse/SPARK-31298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > In SessionCatalog.validateTableLocation() > {code:java} > val tableLocation = > new > Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier))) > {code} > But in CreateDataSourceTableAsSelect , table location use defaultTablePath > {code:java} > assert(table.schema.isEmpty) > sparkSession.sessionState.catalog.validateTableLocation(table) > val tableLocation = if (table.tableType == CatalogTableType.MANAGED) { > Some(sessionState.catalog.defaultTablePath(table.identifier)) > } else { > table.storage.locationUri > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31298) validate External path in SPARK-19724 seems conflict
[ https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31298: -- Description: In SessionCatalog.validateTableLocation() {code:java} val tableLocation = new Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier))) {code} But in CreateDataSourceTableAsSelect , table location use defaultTablePath {code:java} assert(table.schema.isEmpty) sparkSession.sessionState.catalog.validateTableLocation(table) val tableLocation = if (table.tableType == CatalogTableType.MANAGED) { Some(sessionState.catalog.defaultTablePath(table.identifier)) } else { table.storage.locationUri } {code} > validate External path in SPARK-19724 seems conflict > > > Key: SPARK-31298 > URL: https://issues.apache.org/jira/browse/SPARK-31298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > In SessionCatalog.validateTableLocation() > {code:java} > val tableLocation = > new > Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier))) > {code} > But in CreateDataSourceTableAsSelect , table location use defaultTablePath > {code:java} > assert(table.schema.isEmpty) > sparkSession.sessionState.catalog.validateTableLocation(table) > val tableLocation = if (table.tableType == CatalogTableType.MANAGED) { > Some(sessionState.catalog.defaultTablePath(table.identifier)) > } else { > table.storage.locationUri > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31298) validate External path in SPARK-19724 seems conflict
angerszhu created SPARK-31298: - Summary: validate External path in SPARK-19724 seems conflict Key: SPARK-31298 URL: https://issues.apache.org/jira/browse/SPARK-31298 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval
[ https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069939#comment-17069939 ] angerszhu commented on SPARK-31268: --- [https://github.com/apache/spark/pull/28034] > TaskEnd event with zero Executor Metrics when task duration less then poll > interval > --- > > Key: SPARK-31268 > URL: https://issues.apache.org/jira/browse/SPARK-31268 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > Attachments: screenshot-1.png > > > TaskEnd event with zero Executor Metrics when task duration less then poll > interval -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval
[ https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067661#comment-17067661 ] angerszhu commented on SPARK-31268: --- raise a pr soon > TaskEnd event with zero Executor Metrics when task duration less then poll > interval > --- > > Key: SPARK-31268 > URL: https://issues.apache.org/jira/browse/SPARK-31268 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > Attachments: screenshot-1.png > > > TaskEnd event with zero Executor Metrics when task duration less then poll > interval -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31270) Expose executor memory metrics at the task detal, in the Stages tab
[ https://issues.apache.org/jira/browse/SPARK-31270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067657#comment-17067657 ] angerszhu commented on SPARK-31270: --- Raise a pr soon > Expose executor memory metrics at the task detal, in the Stages tab > --- > > Key: SPARK-31270 > URL: https://issues.apache.org/jira/browse/SPARK-31270 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31270) Expose executor memory metrics at the task detal, in the Stages tab
angerszhu created SPARK-31270: - Summary: Expose executor memory metrics at the task detal, in the Stages tab Key: SPARK-31270 URL: https://issues.apache.org/jira/browse/SPARK-31270 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval
[ https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31268: -- Attachment: screenshot-1.png > TaskEnd event with zero Executor Metrics when task duration less then poll > interval > --- > > Key: SPARK-31268 > URL: https://issues.apache.org/jira/browse/SPARK-31268 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > Attachments: screenshot-1.png > > > TaskEnd event with zero Executor Metrics when task duration less then poll > interval -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval
angerszhu created SPARK-31268: - Summary: TaskEnd event with zero Executor Metrics when task duration less then poll interval Key: SPARK-31268 URL: https://issues.apache.org/jira/browse/SPARK-31268 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval
[ https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31268: -- Description: TaskEnd event with zero Executor Metrics when task duration less then poll interval > TaskEnd event with zero Executor Metrics when task duration less then poll > interval > --- > > Key: SPARK-31268 > URL: https://issues.apache.org/jira/browse/SPARK-31268 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > TaskEnd event with zero Executor Metrics when task duration less then poll > interval -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26341) Expose executor memory metrics at the stage level, in the Stages tab
[ https://issues.apache.org/jira/browse/SPARK-26341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066572#comment-17066572 ] angerszhu commented on SPARK-26341: --- I have do this in our own version, will raise a pr soon these days. > Expose executor memory metrics at the stage level, in the Stages tab > > > Key: SPARK-26341 > URL: https://issues.apache.org/jira/browse/SPARK-26341 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 2.4.0 >Reporter: Edward Lu >Priority: Major > > Sub-task SPARK-23431 will add stage level executor memory metrics (peak > values for each stage, and peak values for each executor for the stage). This > information should also be exposed the the web UI, so that users can see > which stages are memory intensive. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31226) SizeBasedCoalesce logic error
angerszhu created SPARK-31226: - Summary: SizeBasedCoalesce logic error Key: SPARK-31226 URL: https://issues.apache.org/jira/browse/SPARK-31226 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31226) SizeBasedCoalesce logic error
[ https://issues.apache.org/jira/browse/SPARK-31226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31226: -- Description: In spark UT, SizeBasedCoalecse's logic is wrong > SizeBasedCoalesce logic error > - > > Key: SPARK-31226 > URL: https://issues.apache.org/jira/browse/SPARK-31226 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Minor > > In spark UT, > SizeBasedCoalecse's logic is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code
[ https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063909#comment-17063909 ] angerszhu commented on SPARK-27097: --- [~irashid] to be honest, I meet this problem these days. [~dbtsai] I have some question. We start a self-developed thrift server program and use spark as compute engine with below javaOptions parameter {color:#e14141}-Xmx64g {color} {color:#e14141}-Djava.library.path=/home/hadoop/hadoop/lib/native {color} {color:#e14141}-Djavax.security.auth.useSubjectCredsOnly=false {color} {color:#e14141}-Dcom.sun.management.jmxremote.port=9021 {color} {color:#e14141}-Dcom.sun.management.jmxremote.authenticate=false {color} {color:#e14141}-Dcom.sun.management.jmxremote.ssl=false {color} {color:#e14141}-XX:MaxPermSize=1024m -XX:PermSize=256m -XX:MaxDirectMemorySize=8192m -XX:-TraceClassUnloading {color} {color:#e14141}-XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:+PrintTenuringDistribution -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -Xnoclassgc -XX:+PrintGCDetails -XX:+PrintGCDateStamps {color} {color:#e14141} {color} {color:#e14141} {color} Then the {color:#347eec}Platform{color}{color:#e14141}.{color} {color:#347eec}BYTE_ARRAY_OFFSET{color} will be 24, when we start a normal spark thrift server, the value will be 16, this problem cause strange data corruption. After few days check, I located the problem because of spark *codegen*, and this pr can fix our problem , but I can’t find evidence why Platform.BYTE_ARRAY_OFFSET will be 24 in above parameter. Since I test in local that when we set {color:#e14141} -XX:+ UseCompressedOops, {color} using pointer compression it's going to be 16. {color:#e14141} -XX:- UseCompressedOops, {color} not using pointer compression it's going to be 24. This is easy to understand why the offset is not same. But I don’t know why above parameter will be 24 since I am not a professor about java compiler and Basic computer knowledge. Can you give me some advisor or information about how to understand and find the root cause. > Avoid embedding platform-dependent offsets literally in whole-stage generated > code > -- > > Key: SPARK-27097 > URL: https://issues.apache.org/jira/browse/SPARK-27097 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.3, 2.2.3, 2.3.4, 2.4.0 >Reporter: Xiao Li >Assignee: Kris Mok >Priority: Critical > Labels: correctness > Fix For: 2.4.1 > > > Avoid embedding platform-dependent offsets literally in whole-stage generated > code. > Spark SQL performs whole-stage code generation to speed up query execution. > There are two steps to it: > Java source code is generated from the physical query plan on the driver. A > single version of the source code is generated from a query plan, and sent to > all executors. > It's compiled to bytecode on the driver to catch compilation errors before > sending to executors, but currently only the generated source code gets sent > to the executors. The bytecode compilation is for fail-fast only. > Executors receive the generated source code and compile to bytecode, then the > query runs like a hand-written Java program. > In this model, there's an implicit assumption about the driver and executors > being run on similar platforms. Some code paths accidentally embedded > platform-dependent object layout information into the generated code, such as: > {code:java} > Platform.putLong(buffer, /* offset */ 24, /* value */ 1); > {code} > This code expects a field to be at offset +24 of the buffer object, and sets > a value to that field. > But whole-stage code generation generally uses platform-dependent information > from the driver. If the object layout is significantly different on the > driver and executors, the generated code can be reading/writing to wrong > offsets on the executors, causing all kinds of data corruption. > One code pattern that leads to such problem is the use of Platform.XXX > constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET. > Bad: > {code:java} > val baseOffset = Platform.BYTE_ARRAY_OFFSET > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into > the generated code. > {code} > Good: > {code:java} > val baseOffset = "Platform.BYTE_ARRAY_OFFSET" > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will generate the offset symbolically -- Platform.putLong(buffer, >
[jira] [Commented] (SPARK-30707) Lead/Lag window function throws AnalysisException without ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-30707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055642#comment-17055642 ] angerszhu commented on SPARK-30707: --- add pr in [https://github.com/apache/spark/pull/27861] > Lead/Lag window function throws AnalysisException without ORDER BY clause > - > > Key: SPARK-30707 > URL: https://issues.apache.org/jira/browse/SPARK-30707 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > Lead/Lag window function throws AnalysisException without ORDER BY clause: > {code:java} > SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four > FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s > org.apache.spark.sql.AnalysisException > Window function lead(ten#x, (four#x + 1), null) requires window to be > ordered, please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + > 1), null)(value_expr) OVER (PARTITION BY window_partition ORDER BY > window_ordering) from table; > {code} > > Maybe we need fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30707) Lead/Lag window function throws AnalysisException without ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-30707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1709#comment-1709 ] angerszhu commented on SPARK-30707: --- our production meet this problem too when hive sql run in spark engine, I am working on this and will raise a pr soon > Lead/Lag window function throws AnalysisException without ORDER BY clause > - > > Key: SPARK-30707 > URL: https://issues.apache.org/jira/browse/SPARK-30707 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > Lead/Lag window function throws AnalysisException without ORDER BY clause: > {code:java} > SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four > FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s > org.apache.spark.sql.AnalysisException > Window function lead(ten#x, (four#x + 1), null) requires window to be > ordered, please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + > 1), null)(value_expr) OVER (PARTITION BY window_partition ORDER BY > window_ordering) from table; > {code} > > Maybe we need fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30694) If exception occured while fetching blocks by ExternalBlockClient, fail early when External Shuffle Service is not alive
angerszhu created SPARK-30694: - Summary: If exception occured while fetching blocks by ExternalBlockClient, fail early when External Shuffle Service is not alive Key: SPARK-30694 URL: https://issues.apache.org/jira/browse/SPARK-30694 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30538) A not very elegant way to control ouput small file
angerszhu created SPARK-30538: - Summary: A not very elegant way to control ouput small file Key: SPARK-30538 URL: https://issues.apache.org/jira/browse/SPARK-30538 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30435) update Spark SQL guide of Supported Hive Features
angerszhu created SPARK-30435: - Summary: update Spark SQL guide of Supported Hive Features Key: SPARK-30435 URL: https://issues.apache.org/jira/browse/SPARK-30435 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29800) Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf
[ https://issues.apache.org/jira/browse/SPARK-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29800: -- Summary: Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf (was: Rewrite non-correlated subquery use ScalaSubquery to optimize perf) > Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf > - > > Key: SPARK-29800 > URL: https://issues.apache.org/jira/browse/SPARK-29800 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29800) Rewrite non-correlated subquery use ScalaSubquery to optimize perf
[ https://issues.apache.org/jira/browse/SPARK-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29800: -- Summary: Rewrite non-correlated subquery use ScalaSubquery to optimize perf (was: Plan Exists 's subquery in PlanSubqueries) > Rewrite non-correlated subquery use ScalaSubquery to optimize perf > -- > > Key: SPARK-29800 > URL: https://issues.apache.org/jira/browse/SPARK-29800 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11
[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29018: -- Description: h2. Background With the development of Spark and Hive,in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions. It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server. We suppose to implement a new thrift server and JDBC driver based on Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature: # Build new module spark-service as spark’s thrift server # Don't need as much reflection and inherited code as `hive-thriftser` modules # Support all functions current `sql/hive-thriftserver` support # Use all code maintained by spark itself, won’t depend on Hive # Support origin functions use spark’s own way, won't limited by Hive's code # Support running without hive metastore or with hive metastore # Support user impersonation by Multi-tenant splited hive authentication and DFS authentication # Support session hook for with spark’s own code # Add a new jdbc driver spark-jdbc, with spark’s own connection url “jdbc:spark::/” # Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform h2. How to start? We can start this new thrift server by shell *sbin/start-spark-thriftserver.sh* and stop it by *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to determine the characteristics of the current spark thrift server service, we have implemented all need configuration by spark itself in `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used to connect to hive metastore. We can write all we needed conf in *conf/spark-default.conf* or in startup command *--conf* h2. How to connect through jdbc? Now we support both hive-jdbc and spark-jdbc, user can choose which one he likes h3. spark-jdbc # use `SparkDriver` as jdbc driver class # Connection url `jdbc:spark://:,:/dbName;sess_var_list?conf_list#var_list` most samse as hive but with spark’s special url prefix `jdbc:spark` # For proxy, use SparkDriver should set proxy conf `spark.sql.thriftserver.proxy.user=username` h3. hive-jdbc # use `HiveDriver` as jdbc driver class # connection str jdbc:hive2://:,:/dbName;sess_var_list?conf_list#var_list as origin # For proxy, use HiveDriver should set proxy conf hive.server2.proxy.user=username, current server support both config h2. How is it done today, and what are the limits of current practice? h3. Current practice We have completed two modules `spark-service` & `spark-jdbc` now, it can run well and we have add origin UT to it these two module and it can pass the UT, for impersonation, we have write the code and test it in our kerberized environment, it can work well and wait for review. Now we will raise pr to apace/spark master branch step by step. h3. Here are some known changes: # Don’t use any hive code in `spark-service` `spark-jdbc` module # In current service, default rcfile suffix `.hiverc` was replaced by `.sparkrc` # When use SparkDriver as jdbc driver class, url should use jdbc:spark://:,:/dbName;sess_var_list?conf_list#var_list # When use SparkDriver as jdbc driver class, proxy conf should be `spark.sql.thriftserver.proxy.user=proxy_user_name` # Support `hiveconf` `hivevar` session conf through hive-jdbc connection h2. What are the risks? Totally new module, won’t change other module’s code except for supporting impersonation. Except impersonation, we have added a lot of UT changed (fit grammar without hive) from origin UT, and all pass it. For impersonation I have test it in our kerberized environment but still need detail review since change a lot. h2. How long will it take? We have done all these works in our own repo, now we plan merge our code into the master step by step. # Phase1 pr about build new module *spark-service* on folder *sql/service* # Phase2 pr thrift protocol and generated thrift protocol java code # Phase3 pr with all *spark-service* module code and description about design, also Unnit Test # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc* # Phase5 pr with all *spark-jdbc* module code and Unit Tests # Phase6 pr about support thriftserver Impersonation # Phase7 pr about build spark's own beeline client *spark-beeline* # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* module named *spark-cli* h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. Compared to current `sql/hive-thriftserver`, corresponding API changes as below: # Add a new
[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11
[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29018: -- Description: SPIP:Build Spark thrift server based on thrift protocol v11 h2. Background With the development of Spark and Hive,in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions. It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server. We suppose to implement a new thrift server and JDBC driver based on Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature: # Build new module spark-service as spark’s thrift server # Don't need as much reflection and inherited code as `hive-thriftser` modules # Support all functions current `sql/hive-thriftserver` support # Use all code maintained by spark itself, won’t depend on Hive # Support origin functions use spark’s own way, won't limited by Hive's code # Support running without hive metastore or with hive metastore # Support user impersonation by Multi-tenant splited hive authentication and DFS authentication # Support session hook for with spark’s own code # Add a new jdbc driver spark-jdbc, with spark’s own connection url “jdbc:spark::/” # Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform h2. How to start? We can start this new thrift server by shell *sbin/start-spark-thriftserver.sh* and stop it by *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to determine the characteristics of the current spark thrift server service, we have implemented all need configuration by spark itself in `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used to connect to hive metastore. We can write all we needed conf in *conf/spark-default.conf* or in startup command *--conf* h2. How to connect through jdbc? Now we support both hive-jdbc and spark-jdbc, user can choose which one he likes h3. spark-jdbc # use `SparkDriver` as jdbc driver class # Connection url `jdbc:spark://:,:/dbName;sess_var_list?conf_list#var_list` most samse as hive but with spark’s special url prefix `jdbc:spark` # For proxy, use SparkDriver should set proxy conf `spark.sql.thriftserver.proxy.user=username` h3. hive-jdbc # use `HiveDriver` as jdbc driver class # connection str jdbc:hive2://:,:/dbName;sess_var_list?conf_list#var_list as origin # For proxy, use HiveDriver should set proxy conf hive.server2.proxy.user=username, current server support both config h2. How is it done today, and what are the limits of current practice? h3. Current practice We have completed two modules `spark-service` & `spark-jdbc` now, it can run well and we have add origin UT to it these two module and it can pass the UT, for impersonation, we have write the code and test it in our kerberized environment, it can work well and wait for review. Now we will raise pr to apace/spark master branch step by step. h3. Here are some known changes: # Don’t use any hive code in `spark-service` `spark-jdbc` module # In current service, default rcfile suffix `.hiverc` was replaced by `.sparkrc` # When use SparkDriver as jdbc driver class, url should use jdbc:spark://:,:/dbName;sess_var_list?conf_list#var_list # When use SparkDriver as jdbc driver class, proxy conf should be `spark.sql.thriftserver.proxy.user=proxy_user_name` # Support `hiveconf` `hivevar` session conf through hive-jdbc connection # h2. What are the risks? Totally new module, won’t change other module’s code except for supporting impersonation. Except impersonation, we have added a lot of UT changed (fit grammar without hive) from origin UT, and all pass it. For impersonation I have test it in our kerberized environment but still need detail review since change a lot. h2. How long will it take? We have done all these works in our own repo, now we plan merge our code into the master step by step. # Phase1 pr about build new module *spark-service* on folder *sql/service* # Phase2 pr thrift protocol and generated thrift protocol java code # Phase3 pr with all *spark-service* module code and description about design, also Unnit Test # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc* # Phase5 pr with all *spark-jdbc* module code and Unit Tests # Phase6 pr about support thriftserver Impersonation # Phase7 pr about build spark's own beeline client *spark-beeline* # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* module named *spark-cli* h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. Compared to
[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11
[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29018: -- Description: With the development of Spark and Hive,in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions. It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server. We suppose to implement a new thrift server and JDBC driver based on Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature: # Build new module spark-service as spark’s thrift server # Don't need as much reflection and inherited code as `hive-thriftser` modules # Support all functions current `sql/hive-thriftserver` support # Use all code maintained by spark itself, won’t depend on Hive # Support origin functions use spark’s own way, won't limited by Hive's code # Support running without hive metastore or with hive metastore # Support user impersonation by Multi-tenant splited hive authentication and DFS authentication # Support session hook for with spark’s own code # Add a new jdbc driver spark-jdbc, with spark’s own connection url “jdbc:spark::/” # Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform was: With the development of Spark and Hive,in current `sql/hive-thriftserver`, we need to do a lot of work to solve code conflicts between different hive versions. It's an annoying and unending work in current ways. And these issues are troubling us when we develop new features for the SparkThriftServer2. We suppose to implement a new thrift server based on latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's own code to get rid of hive code . Finally, the new thrift server have below feature: # support all functions current `hive-thriftserver` support # use all code maintained by spark itself # realize origin function fit to spark’s own feature, won't limited by hive's code # support running without hive metastore or with hive metastore # support user impersonation by Multi-tenant authority separation, hive authentication and DFS authentication # add a new module `spark-jdbc`, with connection url `jdbc:spark::/`, all `hive-jdbc` support we will all support # support both `hive-jdbc` and `spark-jdbc` client for compatibility with most clients We have done all these works in our repo, now we plan merge our code into the master step by step. # *phase1* pr about build new module `spark-service` on folder `sql/service` 2. *phase2* pr thrift protocol and generated thrift protocol java code 3. *phase3* pr with all `spark-service` module code and description about design, also UT 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc` 5. *phase5* pr with all `spark-jdbc` module code and UT 6. *phase6* pr about support thriftserver Impersonation 7. *phase7* pr about build spark's own beeline client `spark-beeline` 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli` Google Doc : [https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.im79wwkzycsr] > Build spark thrift server on it's own code based on protocol v11 > > > Key: SPARK-29018 > URL: https://issues.apache.org/jira/browse/SPARK-29018 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > With the development of Spark and Hive,in current sql/hive-thriftserver > module, we need to do a lot of work to solve code conflicts for different > built-in hive versions. It's an annoying and unending work in current ways. > And these issues have limited our ability and convenience to develop new > features for Spark’s thrift server. > We suppose to implement a new thrift server and JDBC driver based on > Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift > server have below feature: > # Build new module spark-service as spark’s thrift server > # Don't need as much reflection and inherited code as `hive-thriftser` > modules > # Support all functions current `sql/hive-thriftserver` support > # Use all code maintained by spark itself, won’t depend on Hive > # Support origin functions use spark’s own way, won't limited by Hive's code > # Support running without hive metastore or with hive metastore > # Support user impersonation by Multi-tenant splited hive authentication and > DFS authentication > # Support session hook for with spark’s own code > # Add a new jdbc driver spark-jdbc, with spark’s own connection url >
[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11
[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29018: -- Summary: Build spark thrift server on it's own code based on protocol v11 (was: Spark ThriftServer change to it's own API) > Build spark thrift server on it's own code based on protocol v11 > > > Key: SPARK-29018 > URL: https://issues.apache.org/jira/browse/SPARK-29018 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > With the development of Spark and Hive,in current `sql/hive-thriftserver`, we > need to do a lot of work to solve code conflicts between different hive > versions. It's an annoying and unending work in current ways. And these > issues are troubling us when we develop new features for the > SparkThriftServer2. We suppose to implement a new thrift server based on > latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's > own code to get rid of hive code . > Finally, the new thrift server have below feature: > # support all functions current `hive-thriftserver` support > # use all code maintained by spark itself > # realize origin function fit to spark’s own feature, won't limited by > hive's code > # support running without hive metastore or with hive metastore > # support user impersonation by Multi-tenant authority separation, hive > authentication and DFS authentication > # add a new module `spark-jdbc`, with connection url > `jdbc:spark::/`, all `hive-jdbc` support we will all support > # support both `hive-jdbc` and `spark-jdbc` client for compatibility with > most clients > We have done all these works in our repo, now we plan merge our code into the > master step by step. > # *phase1* pr about build new module `spark-service` on folder `sql/service` > 2. *phase2* pr thrift protocol and generated thrift protocol java code > 3. *phase3* pr with all `spark-service` module code and description about > design, also UT > 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc` > 5. *phase5* pr with all `spark-jdbc` module code and UT > 6. *phase6* pr about support thriftserver Impersonation > 7. *phase7* pr about build spark's own beeline client `spark-beeline` > 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli` > > > Google Doc : > [https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.im79wwkzycsr] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30293) In HiveThriftServer2 will remove wrong statement
angerszhu created SPARK-30293: - Summary: In HiveThriftServer2 will remove wrong statement Key: SPARK-30293 URL: https://issues.apache.org/jira/browse/SPARK-30293 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: angerszhu In HiveThriftServer2 will remove wrong statement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29018) Spark ThriftServer change to it's own API
[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29018: -- Description: With the development of Spark and Hive,in current `sql/hive-thriftserver`, we need to do a lot of work to solve code conflicts between different hive versions. It's an annoying and unending work in current ways. And these issues are troubling us when we develop new features for the SparkThriftServer2. We suppose to implement a new thrift server based on latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's own code to get rid of hive code . Finally, the new thrift server have below feature: # support all functions current `hive-thriftserver` support # use all code maintained by spark itself # realize origin function fit to spark’s own feature, won't limited by hive's code # support running without hive metastore or with hive metastore # support user impersonation by Multi-tenant authority separation, hive authentication and DFS authentication # add a new module `spark-jdbc`, with connection url `jdbc:spark::/`, all `hive-jdbc` support we will all support # support both `hive-jdbc` and `spark-jdbc` client for compatibility with most clients We have done all these works in our repo, now we plan merge our code into the master step by step. # *phase1* pr about build new module `spark-service` on folder `sql/service` 2. *phase2* pr thrift protocol and generated thrift protocol java code 3. *phase3* pr with all `spark-service` module code and description about design, also UT 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc` 5. *phase5* pr with all `spark-jdbc` module code and UT 6. *phase6* pr about support thriftserver Impersonation 7. *phase7* pr about build spark's own beeline client `spark-beeline` 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli` Google Doc : [https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.im79wwkzycsr] was: With the development of Spark and Hive,in current `sql/hive-thriftserver`, we need to do a lot of work to solve code conflicts between different hive versions. It's an annoying and unending work in current ways. And these issues are troubling us when we develop new features for the SparkThriftServer2. We suppose to implement a new thrift server based on latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's own code to get rid of hive code . Finally, the new thrift server have below feature: # support all functions current `hive-thriftserver` support # use all code maintained by spark itself # realize origin function fit to spark’s own feature, won't limited by hive's code # support running without hive metastore or with hive metastore # support user impersonation by Multi-tenant authority separation, hive authentication and DFS authentication # add a new module `spark-jdbc`, with connection url `jdbc:spark::/`, all `hive-jdbc` support we will all support # support both `hive-jdbc` and `spark-jdbc` client for compatibility with most clients We have done all these works in our repo, now we plan merge our code into the master step by step. 1. *phase1* pr about build new module `spark-service` on folder `sql/service` 2. *phase2* pr thrift protocol and generated thrift protocol java code 3. *phase3* pr with all `spark-service` module code and description about design, also UT 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc` 5. *phase5* pr with all `spark-jdbc` module code and UT 6. *phase6* pr about support thriftserver Impersonation 7. *phase7* pr about build spark's own beeline client `spark-beeline` 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli` > Spark ThriftServer change to it's own API > - > > Key: SPARK-29018 > URL: https://issues.apache.org/jira/browse/SPARK-29018 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > With the development of Spark and Hive,in current `sql/hive-thriftserver`, we > need to do a lot of work to solve code conflicts between different hive > versions. It's an annoying and unending work in current ways. And these > issues are troubling us when we develop new features for the > SparkThriftServer2. We suppose to implement a new thrift server based on > latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's > own code to get rid of hive code . > Finally, the new thrift server have below feature: > # support all functions current `hive-thriftserver` support > # use all code maintained by spark itself > # realize origin function fit to spark’s own feature, won't
[jira] [Updated] (SPARK-30287) Add new module spark-service as thrift server module
[ https://issues.apache.org/jira/browse/SPARK-30287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-30287: -- Description: Add a new module of `sql/service` as spark thrift server module (was: Add a new module of thriftserver) > Add new module spark-service as thrift server module > > > Key: SPARK-30287 > URL: https://issues.apache.org/jira/browse/SPARK-30287 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > Add a new module of `sql/service` as spark thrift server module -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29018) Spark ThriftServer change to it's own API
[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29018: -- Description: With the development of Spark and Hive,in current `sql/hive-thriftserver`, we need to do a lot of work to solve code conflicts between different hive versions. It's an annoying and unending work in current ways. And these issues are troubling us when we develop new features for the SparkThriftServer2. We suppose to implement a new thrift server based on latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's own code to get rid of hive code . Finally, the new thrift server have below feature: # support all functions current `hive-thriftserver` support # use all code maintained by spark itself # realize origin function fit to spark’s own feature, won't limited by hive's code # support running without hive metastore or with hive metastore # support user impersonation by Multi-tenant authority separation, hive authentication and DFS authentication # add a new module `spark-jdbc`, with connection url `jdbc:spark::/`, all `hive-jdbc` support we will all support # support both `hive-jdbc` and `spark-jdbc` client for compatibility with most clients We have done all these works in our repo, now we plan merge our code into the master step by step. 1. *phase1* pr about build new module `spark-service` on folder `sql/service` 2. *phase2* pr thrift protocol and generated thrift protocol java code 3. *phase3* pr with all `spark-service` module code and description about design, also UT 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc` 5. *phase5* pr with all `spark-jdbc` module code and UT 6. *phase6* pr about support thriftserver Impersonation 7. *phase7* pr about build spark's own beeline client `spark-beeline` 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli` was: With the development of Spark and Hive,in current `sql/hive-thriftserver`, we need to do a lot of work to solve code conflicts between different hive versions. It's an annoying and unending work in current ways. And these issues are troubling us when we develop new features for the SparkThriftServer2. We suppose to implement a new thrift server based on latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's own code to get rid of hive code . Finally, the new thrift server have below feature: #. support all functions current `hive-thriftserver` support #. use all code maintained by spark itself 3. realize origin function fit to spark’s own feature, won't limited by hive's code 4. support running without hive metastore or with hive metastore 5. support user impersonation by Multi-tenant authority separation, hive authentication and DFS authentication 6. add a new module `spark-jdbc`, with connection url `jdbc:spark::/`, all `hive-jdbc` support we will all support 7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with most clients We have done all these works in our repo, now we plan merge our code into the master step by step. 1. *phase1* pr about build new module `spark-service` on folder `sql/service` 2. *phase2* pr thrift protocol and generated thrift protocol java code 3. *phase3* pr with all `spark-service` module code and description about design, also UT 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc` 5. *phase5* pr with all `spark-jdbc` module code and UT 6. *phase6* pr about support thriftserver Impersonation 7. *phase7* pr about build spark's own beeline client `spark-beeline` 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli` > Spark ThriftServer change to it's own API > - > > Key: SPARK-29018 > URL: https://issues.apache.org/jira/browse/SPARK-29018 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > With the development of Spark and Hive,in current `sql/hive-thriftserver`, we > need to do a lot of work to solve code conflicts between different hive > versions. It's an annoying and unending work in current ways. And these > issues are troubling us when we develop new features for the > SparkThriftServer2. We suppose to implement a new thrift server based on > latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's > own code to get rid of hive code . > Finally, the new thrift server have below feature: > # support all functions current `hive-thriftserver` support > # use all code maintained by spark itself > # realize origin function fit to spark’s own feature, won't limited by hive's > code > # support running without hive metastore or with hive metastore > # support user
[jira] [Updated] (SPARK-29018) Spark ThriftServer change to it's own API
[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29018: -- Description: With the development of Spark and Hive,in current `sql/hive-thriftserver`, we need to do a lot of work to solve code conflicts between different hive versions. It's an annoying and unending work in current ways. And these issues are troubling us when we develop new features for the SparkThriftServer2. We suppose to implement a new thrift server based on latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's own code to get rid of hive code . Finally, the new thrift server have below feature: #. support all functions current `hive-thriftserver` support #. use all code maintained by spark itself 3. realize origin function fit to spark’s own feature, won't limited by hive's code 4. support running without hive metastore or with hive metastore 5. support user impersonation by Multi-tenant authority separation, hive authentication and DFS authentication 6. add a new module `spark-jdbc`, with connection url `jdbc:spark::/`, all `hive-jdbc` support we will all support 7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with most clients We have done all these works in our repo, now we plan merge our code into the master step by step. 1. *phase1* pr about build new module `spark-service` on folder `sql/service` 2. *phase2* pr thrift protocol and generated thrift protocol java code 3. *phase3* pr with all `spark-service` module code and description about design, also UT 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc` 5. *phase5* pr with all `spark-jdbc` module code and UT 6. *phase6* pr about support thriftserver Impersonation 7. *phase7* pr about build spark's own beeline client `spark-beeline` 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli` was: ### What changes were proposed in this pull request? With the development of Spark and Hive,in current `sql/hive-thriftserver`, we need to do a lot of work to solve code conflicts between different hive versions. It's an annoying and unending work in current ways. And these issues are troubling us when we develop new features for the SparkThriftServer2. We suppose to implement a new thrift server based on latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's own code to get rid of hive code . Finally, the new thrift server have below feature: 1. support all functions current `hive-thriftserver` support 2. use all code maintained by spark itself 3. realize origin function fit to spark’s own feature, won't limited by hive's code 4. support running without hive metastore or with hive metastore 5. support user impersonation by Multi-tenant authority separation, hive authentication and DFS authentication 6. add a new module `spark-jdbc`, with connection url `jdbc:spark::/`, all `hive-jdbc` support we will all support 7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with most clients We have done all these works in our repo, now we plan merge our code into the master step by step. 1. **phase1** pr about build new module `spark-service` on folder `sql/service` 2. **phase2** pr thrift protocol and generated thrift protocol java code 3. **phase3** pr with all `spark-service` module code and description about design, also UT 4. **phase4** pr about build new module `spark-jdbc` on folder `sql/jdbc` 5. **phase5** pr with all `spark-jdbc` module code and UT 6. **phase6** pr about support thriftserver Impersonation 7. **phase7** pr about build spark's own beeline client `spark-beeline` 8. **phase8** pr about spark's own Cli client `Spark SQL CLI` `spark-cli` ### Why are the changes needed? Build a totally new thrift server base on spark's own code and feature.Don't rely on hive code anymore ### Does this PR introduce any user-facing change? ### How was this patch tested? Not need UT now > Spark ThriftServer change to it's own API > - > > Key: SPARK-29018 > URL: https://issues.apache.org/jira/browse/SPARK-29018 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > With the development of Spark and Hive,in current `sql/hive-thriftserver`, we > need to do a lot of work to solve code conflicts between different hive > versions. It's an annoying and unending work in current ways. And these > issues are troubling us when we develop new features for the > SparkThriftServer2. We suppose to implement a new thrift server based on > latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's > own code to get rid of hive code . > Finally, the
[jira] [Updated] (SPARK-29018) Spark ThriftServer change to it's own API
[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29018: -- Description: ### What changes were proposed in this pull request? With the development of Spark and Hive,in current `sql/hive-thriftserver`, we need to do a lot of work to solve code conflicts between different hive versions. It's an annoying and unending work in current ways. And these issues are troubling us when we develop new features for the SparkThriftServer2. We suppose to implement a new thrift server based on latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's own code to get rid of hive code . Finally, the new thrift server have below feature: 1. support all functions current `hive-thriftserver` support 2. use all code maintained by spark itself 3. realize origin function fit to spark’s own feature, won't limited by hive's code 4. support running without hive metastore or with hive metastore 5. support user impersonation by Multi-tenant authority separation, hive authentication and DFS authentication 6. add a new module `spark-jdbc`, with connection url `jdbc:spark::/`, all `hive-jdbc` support we will all support 7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with most clients We have done all these works in our repo, now we plan merge our code into the master step by step. 1. **phase1** pr about build new module `spark-service` on folder `sql/service` 2. **phase2** pr thrift protocol and generated thrift protocol java code 3. **phase3** pr with all `spark-service` module code and description about design, also UT 4. **phase4** pr about build new module `spark-jdbc` on folder `sql/jdbc` 5. **phase5** pr with all `spark-jdbc` module code and UT 6. **phase6** pr about support thriftserver Impersonation 7. **phase7** pr about build spark's own beeline client `spark-beeline` 8. **phase8** pr about spark's own Cli client `Spark SQL CLI` `spark-cli` ### Why are the changes needed? Build a totally new thrift server base on spark's own code and feature.Don't rely on hive code anymore ### Does this PR introduce any user-facing change? ### How was this patch tested? Not need UT now was: Current SparkThriftServer rely on HiveServer2 too much, when Hive version changed, we should change a lot to fit for Hive code change. We would best just use Hive's thrift interface to implement it 's own API for SparkThriftServer. And remove unused code logical [for Spark Thrift Server]. > Spark ThriftServer change to it's own API > - > > Key: SPARK-29018 > URL: https://issues.apache.org/jira/browse/SPARK-29018 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > ### What changes were proposed in this pull request? > With the development of Spark and Hive,in current `sql/hive-thriftserver`, we > need to do a lot of work to solve code conflicts between different hive > versions. It's an annoying and unending work in current ways. And these > issues are troubling us when we develop new features for the > SparkThriftServer2. We suppose to implement a new thrift server based on > latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's > own code to get rid of hive code . > Finally, the new thrift server have below feature: > 1. support all functions current `hive-thriftserver` support > 2. use all code maintained by spark itself > 3. realize origin function fit to spark’s own feature, won't limited by > hive's code > 4. support running without hive metastore or with hive metastore > 5. support user impersonation by Multi-tenant authority separation, hive > authentication and DFS authentication > 6. add a new module `spark-jdbc`, with connection url > `jdbc:spark::/`, all `hive-jdbc` support we will all support > 7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with > most clients > We have done all these works in our repo, now we plan merge our code into > the master step by step. > 1. **phase1** pr about build new module `spark-service` on folder > `sql/service` > 2. **phase2** pr thrift protocol and generated thrift protocol java code > 3. **phase3** pr with all `spark-service` module code and description about > design, also UT > 4. **phase4** pr about build new module `spark-jdbc` on folder `sql/jdbc` > 5. **phase5** pr with all `spark-jdbc` module code and UT > 6. **phase6** pr about support thriftserver Impersonation > 7. **phase7** pr about build spark's own beeline client `spark-beeline` > 8. **phase8** pr about spark's own Cli client `Spark SQL CLI` `spark-cli` > ### Why are the changes needed? > Build a totally new thrift server base on
[jira] [Created] (SPARK-30287) Add new module spark-service as thrift server module
angerszhu created SPARK-30287: - Summary: Add new module spark-service as thrift server module Key: SPARK-30287 URL: https://issues.apache.org/jira/browse/SPARK-30287 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: angerszhu Add a new module of thriftserver -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30271) dynamic allocation won't release some executor in some case.
angerszhu created SPARK-30271: - Summary: dynamic allocation won't release some executor in some case. Key: SPARK-30271 URL: https://issues.apache.org/jira/browse/SPARK-30271 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.4.0 Reporter: angerszhu Case : max executor 5 min executor 0 idle time 5s stage-1 10 tasks run in 5 executors. If stage-1 finished in 5 all executors, all executor added to `removeTimes` when taskEnd event. After 5s, start release process, since stage-2 have 20 tasks, then executor won't be removed since existing executor num < executorTargetNum., and executor will be removed from `removeTimes`. But if task won't be scheduled to all these executors, if executor-1 won't have task to run in it, it won't be put into `removeTimes` and if there are no more tasks, executor won't be removed forever -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16180) Task hang on fetching blocks (cached RDD)
[ https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997310#comment-16997310 ] angerszhu commented on SPARK-16180: --- i meet this problem recently in spark-2.4 > Task hang on fetching blocks (cached RDD) > - > > Key: SPARK-16180 > URL: https://issues.apache.org/jira/browse/SPARK-16180 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.1 >Reporter: Davies Liu >Priority: Major > Labels: bulk-closed > > Here is the stackdump of executor: > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:107) > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) > org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) > org.apache.spark.storage.BlockManager.get(BlockManager.scala:630) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) > org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46) > org.apache.spark.scheduler.Task.run(Task.scala:96) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30237) Move `sql()` method from DataType to AbstractType
angerszhu created SPARK-30237: - Summary: Move `sql()` method from DataType to AbstractType Key: SPARK-30237 URL: https://issues.apache.org/jira/browse/SPARK-30237 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: angerszhu Move `sql()` method from DataType to AbstractType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30223) queries in thrift server may read wrong SQL configs
[ https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994061#comment-16994061 ] angerszhu edited comment on SPARK-30223 at 12/12/19 1:49 AM: - [~cloud_fan] Add ``` SparkSession.setActiveSession(sqlContext.sparkSession) ``` in each Metadata Operation? Maybe we should sort out the use of `SQLConf.get()` like https://github.com/apache/spark/pull/26187 was (Author: angerszhuuu): [~cloud_fan] Add ``` SparkSession.setActiveSession(sqlContext.sparkSession) ``` in each Metadata Operation? > queries in thrift server may read wrong SQL configs > --- > > Key: SPARK-30223 > URL: https://issues.apache.org/jira/browse/SPARK-30223 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The Spark thrift server creates many SparkSessions to serve requests, and the > thrift server serves requests using a single thread. One thread can only have > one active SparkSession, so SQLCong.get can't get the proper conf from the > session that runs the query. > Whenever we issue an action on a SparkSession, we should set this session as > active session, e.g. `SparkSession.sql`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30223) queries in thrift server may read wrong SQL configs
[ https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994061#comment-16994061 ] angerszhu edited comment on SPARK-30223 at 12/12/19 1:49 AM: - [~cloud_fan] [~yumwang] Add ``` SparkSession.setActiveSession(sqlContext.sparkSession) ``` in each Metadata Operation? Maybe we should sort out the use of `SQLConf.get()` like https://github.com/apache/spark/pull/26187 was (Author: angerszhuuu): [~cloud_fan] Add ``` SparkSession.setActiveSession(sqlContext.sparkSession) ``` in each Metadata Operation? Maybe we should sort out the use of `SQLConf.get()` like https://github.com/apache/spark/pull/26187 > queries in thrift server may read wrong SQL configs > --- > > Key: SPARK-30223 > URL: https://issues.apache.org/jira/browse/SPARK-30223 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The Spark thrift server creates many SparkSessions to serve requests, and the > thrift server serves requests using a single thread. One thread can only have > one active SparkSession, so SQLCong.get can't get the proper conf from the > session that runs the query. > Whenever we issue an action on a SparkSession, we should set this session as > active session, e.g. `SparkSession.sql`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30223) queries in thrift server may read wrong SQL configs
[ https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994061#comment-16994061 ] angerszhu commented on SPARK-30223: --- [~cloud_fan] Add ``` SparkSession.setActiveSession(sqlContext.sparkSession) ``` in each Metadata Operation? > queries in thrift server may read wrong SQL configs > --- > > Key: SPARK-30223 > URL: https://issues.apache.org/jira/browse/SPARK-30223 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The Spark thrift server creates many SparkSessions to serve requests, and the > thrift server serves requests using a single thread. One thread can only have > one active SparkSession, so SQLCong.get can't get the proper conf from the > session that runs the query. > Whenever we issue an action on a SparkSession, we should set this session as > active session, e.g. `SparkSession.sql`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20135) spark thriftserver2: no job running but containers not release on yarn
[ https://issues.apache.org/jira/browse/SPARK-20135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992644#comment-16992644 ] angerszhu commented on SPARK-20135: --- meet same problem in spark2.4.0 [~xwc3504] Have you have any idel now? > spark thriftserver2: no job running but containers not release on yarn > -- > > Key: SPARK-20135 > URL: https://issues.apache.org/jira/browse/SPARK-20135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark 2.0.1 with hadoop 2.6.0 >Reporter: bruce xu >Priority: Major > Attachments: 0329-1.png, 0329-2.png, 0329-3.png > > > i opened the executor dynamic allocation feature, however it doesn't work > sometimes. > i set the initial executor num 50, after job finished the cores and mem > resource did not release. > from the spark web UI, the active job/running task/stage num is 0 , but the > executors page show cores 1276, active task 7288. > from the yarn web UI, the thriftserver job's running containers is 639 > without releasing. > this may be a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails
[ https://issues.apache.org/jira/browse/SPARK-29998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980128#comment-16980128 ] angerszhu commented on SPARK-29998: --- [~cloud_fan] No, not same problem but all caused by disk broken. In my case, it happened when starting task and get broadcast value. His case is happened when shuffle data. His pr can't fix this, but seems we can fix this problem in similar ways, make it retry and choose a right folder. > A corrupted hard disk causes the task to execute repeatedly on a machine > until the job fails > > > Key: SPARK-29998 > URL: https://issues.apache.org/jira/browse/SPARK-29998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > > Recently, I meet one situation: > One NodeManager's disk is broken. when task begin to run, it will get jobConf > by broadcast, executor's BlockManager failed to create file. and throw > IOException. > {code} > 19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: > "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due > to Job aborted due to stage failure: Task 21 in st > age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 > (TID 34968, hostname, executor 104): java.io.IOException: Failed to create > local dir in /disk > 11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b. > at > org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70) > at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:228) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
[jira] [Commented] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails
[ https://issues.apache.org/jira/browse/SPARK-29998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980024#comment-16980024 ] angerszhu commented on SPARK-29998: --- [~yumwang][~dongjoon] [~cloud_fan] For this problem, I think we have two way to fix this. # If this situation happened , make execute failed # defined a special exception and handle it like `FetchFailed` WDYT? > A corrupted hard disk causes the task to execute repeatedly on a machine > until the job fails > > > Key: SPARK-29998 > URL: https://issues.apache.org/jira/browse/SPARK-29998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > > Recently, I meet one situation: > One NodeManager's disk is broken. when task begin to run, it will get jobConf > by broadcast, executor's BlockManager failed to create file. and throw > IOException. > {code} > 19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: > "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due > to Job aborted due to stage failure: Task 21 in st > age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 > (TID 34968, hostname, executor 104): java.io.IOException: Failed to create > local dir in /disk > 11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b. > at > org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70) > at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:228) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
[jira] [Updated] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails
[ https://issues.apache.org/jira/browse/SPARK-29998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29998: -- Description: Recently, I meet one situation: One NodeManager's disk is broken. when task begin to run, it will get jobConf by broadcast, executor's BlockManager failed to create file. and throw IOException. {code} 19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due to Job aborted due to stage failure: Task 21 in st age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 (TID 34968, hostname, executor 104): java.io.IOException: Failed to create local dir in /disk 11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b. at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70) at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129) at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:228) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} Since in TaskSetManager.handleFailedTask() For this kind of fail reason, it will retry on this Executor until `failedTime > maxTaskFailTime ` Then this stage failed, total job failed. was: Recently, I meet one situation: One NodeManager's disk is broken. when task begin to run, it will get jobConf by broadcast, executor's BlockManager failed to create file. and throw IOException. ``` 19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due
[jira] [Updated] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails
[ https://issues.apache.org/jira/browse/SPARK-29998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29998: -- Description: Recently, I meet one situation: One NodeManager's disk is broken. when task begin to run, it will get jobConf by broadcast, executor's BlockManager failed to create file. and throw IOException. ``` 19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due to Job aborted due to stage failure: Task 21 in st age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 (TID 34968, hostname, executor 104): java.io.IOException: Failed to create local dir in /disk 11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b. at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70) at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129) at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:228) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` Since in TaskSetManager.handleFailedTask() For this kind of fail reason, it will retry on this Executor until `failedTime > maxTaskFailTime ` Then this stage failed, total job failed. > A corrupted hard disk causes the task to execute repeatedly on a machine > until the job fails > > > Key: SPARK-29998 > URL: https://issues.apache.org/jira/browse/SPARK-29998 > Project: Spark > Issue Type:
[jira] [Created] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails
angerszhu created SPARK-29998: - Summary: A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails Key: SPARK-29998 URL: https://issues.apache.org/jira/browse/SPARK-29998 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29957) Bump MiniKdc to 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29957: -- Description: Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11. New encryption types of es128-cts-hmac-sha256-128 and aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support these encryption types and does not work well when these encryption types are enabled, which results in the authentication failure. was: ince MiniKdc version lower than hadoop-3.0 can't work well in jdk11. New encryption types of es128-cts-hmac-sha256-128 and aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support these encryption types and does not work well when these encryption types are enabled, which results in the authentication failure. > Bump MiniKdc to 3.2.0 > - > > Key: SPARK-29957 > URL: https://issues.apache.org/jira/browse/SPARK-29957 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11. > New encryption types of es128-cts-hmac-sha256-128 and > aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in > Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support > these encryption types and does not work well when these encryption types are > enabled, which results in the authentication failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29957) Bump MiniKdc to 3.2.0
angerszhu created SPARK-29957: - Summary: Bump MiniKdc to 3.2.0 Key: SPARK-29957 URL: https://issues.apache.org/jira/browse/SPARK-29957 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.0.0 Reporter: angerszhu ince MiniKdc version lower than hadoop-3.0 can't work well in jdk11. New encryption types of es128-cts-hmac-sha256-128 and aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support these encryption types and does not work well when these encryption types are enabled, which results in the authentication failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29874) Optimize Dataset.isEmpty()
angerszhu created SPARK-29874: - Summary: Optimize Dataset.isEmpty() Key: SPARK-29874 URL: https://issues.apache.org/jira/browse/SPARK-29874 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29800) Plan Exists 's subquery in PlanSubqueries
[ https://issues.apache.org/jira/browse/SPARK-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970040#comment-16970040 ] angerszhu commented on SPARK-29800: --- raise pr soon > Plan Exists 's subquery in PlanSubqueries > - > > Key: SPARK-29800 > URL: https://issues.apache.org/jira/browse/SPARK-29800 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29800) Plan Exists 's subquery in PlanSubqueries
angerszhu created SPARK-29800: - Summary: Plan Exists 's subquery in PlanSubqueries Key: SPARK-29800 URL: https://issues.apache.org/jira/browse/SPARK-29800 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29769: -- Description: In origin master, we can'y run sql use `EXISTS/NOT EXISTS` in Join's on condition: {code} create temporary view s1 as select * from values (1), (3), (5), (7), (9) as s1(id); create temporary view s2 as select * from values (1), (3), (4), (6), (9) as s2(id); create temporary view s3 as select * from values (3), (4), (6), (9) as s3(id); explain extended SELECT s1.id, s2.id as id2 FROM s1 LEFT OUTER JOIN s2 ON s1.id = s2.id AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6) we will get == Parsed Logical Plan == 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- 'UnresolvedRelation `s1` +- 'UnresolvedRelation `s2` == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] == Optimized Logical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] == Physical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] Time taken: 1.455 seconds, Fetched 1 row(s) {code} Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved. After pr https://github.com/apache/spark/pull/25854/files We will solve subqueries in join condition and it will pass analyzer level. In current master, if we run sql above, we will get {code} == Parsed Logical Plan == 'Project ['s1.id, 's2.id AS id2#291] +- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation [s3] :- 'UnresolvedRelation [s1] +- 'UnresolvedRelation [s2] == Analyzed Logical Plan == id: int, id2: int Project [id#244, id#250 AS id2#291] +- Join LeftOuter, ((id#244 = id#250) AND exists#290 []) : +- Project [id#256] : +- Filter (id#256 > 6) :+- SubqueryAlias `s3` : +- Project [value#253 AS id#256] : +- LocalRelation [value#253] :- SubqueryAlias `s1` : +- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- SubqueryAlias `s2` +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Optimized Logical Plan == Project [id#244, id#250 AS id2#291] +- Join LeftOuter, (exists#290 [] AND (id#244 = id#250)) : +- Project [value#253 AS id#256] : +- Filter (value#253 > 6) :+- LocalRelation [value#253] :- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Physical Plan == *(2) Project [id#244, id#250 AS id2#291] +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 [] : +- Project [value#253 AS id#256] : +- Filter (value#253 > 6) :+- LocalRelation [value#253] :- *(2) Project [value#241 AS id#244] : +- *(2) LocalTableScan [value#241] +-
[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29769: -- Description: In origin master, we can'y run sql use `EXISTS/NOT EXISTS` in Join's on condition: {code} create temporary view s1 as select * from values (1), (3), (5), (7), (9) as s1(id); create temporary view s2 as select * from values (1), (3), (4), (6), (9) as s2(id); create temporary view s3 as select * from values (3), (4), (6), (9) as s3(id); explain extended SELECT s1.id, s2.id as id2 FROM s1 LEFT OUTER JOIN s2 ON s1.id = s2.id AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6) we will get == Parsed Logical Plan == 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- 'UnresolvedRelation `s1` +- 'UnresolvedRelation `s2` == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] == Optimized Logical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] == Physical Plan == org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27; 'Project ['s1.id, 's2.id AS id2#4] +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation `s3` :- SubqueryAlias `s1` : +- Project [id#0] : +- SubqueryAlias `s1` :+- LocalRelation [id#0] +- SubqueryAlias `s2` +- Project [id#1] +- SubqueryAlias `s2` +- LocalRelation [id#1] Time taken: 1.455 seconds, Fetched 1 row(s) {code} Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved. After pr https://github.com/apache/spark/pull/25854/files We will solve subqueries in join condition and it will pass analyzer level. In current master, if we run sql above, we will get {code} == Parsed Logical Plan == 'Project ['s1.id, 's2.id AS id2#291] +- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 []) : +- 'Project [*] : +- 'Filter ('s3.id > 6) :+- 'UnresolvedRelation [s3] :- 'UnresolvedRelation [s1] +- 'UnresolvedRelation [s2] == Analyzed Logical Plan == id: int, id2: int Project [id#244, id#250 AS id2#291] +- Join LeftOuter, ((id#244 = id#250) AND exists#290 []) : +- Project [id#256] : +- Filter (id#256 > 6) :+- SubqueryAlias `s3` : +- Project [value#253 AS id#256] : +- LocalRelation [value#253] :- SubqueryAlias `s1` : +- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- SubqueryAlias `s2` +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Optimized Logical Plan == Project [id#244, id#250 AS id2#291] +- Join LeftOuter, (exists#290 [] AND (id#244 = id#250)) : +- Project [value#253 AS id#256] : +- Filter (value#253 > 6) :+- LocalRelation [value#253] :- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Physical Plan == *(2) Project [id#244, id#250 AS id2#291] +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 [] : +- Project [value#253 AS id#256] : +- Filter (value#253 > 6) :+- LocalRelation [value#253] :- *(2) Project [value#241 AS id#244] : +- *(2) LocalTableScan [value#241] +-
[jira] [Reopened] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu reopened SPARK-29769: --- > Spark SQL cannot handle "exists/not exists" condition when using "JOIN" > --- > > Key: SPARK-29769 > URL: https://issues.apache.org/jira/browse/SPARK-29769 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-29769. --- Resolution: Invalid > Spark SQL cannot handle "exists/not exists" condition when using "JOIN" > --- > > Key: SPARK-29769 > URL: https://issues.apache.org/jira/browse/SPARK-29769 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
angerszhu created SPARK-29769: - Summary: Spark SQL cannot handle "exists/not exists" condition when using "JOIN" Key: SPARK-29769 URL: https://issues.apache.org/jira/browse/SPARK-29769 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17398) Failed to query on external JSon Partitioned table
[ https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968033#comment-16968033 ] angerszhu commented on SPARK-17398: --- [~bianqi] Hi, I meet this problem too, have you know what's the problem? > Failed to query on external JSon Partitioned table > -- > > Key: SPARK-17398 > URL: https://issues.apache.org/jira/browse/SPARK-17398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang >Priority: Major > Fix For: 2.0.1 > > Attachments: screenshot-1.png > > > 1. Create External Json partitioned table > with SerDe in hive-hcatalog-core-1.2.1.jar, download fom > https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1 > 2. Query table meet exception, which works in spark1.5.2 > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: > Lost task > 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: > java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord > at > org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > > 3. Test Code > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.sql.hive.HiveContext > object JsonBugs { > def main(args: Array[String]): Unit = { > val table = "test_json" > val location = "file:///g:/home/test/json" > val create = s"""CREATE EXTERNAL TABLE ${table} > (id string, seq string ) > PARTITIONED BY(index int) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > LOCATION "${location}" > """ > val add_part = s""" > ALTER TABLE ${table} ADD > PARTITION (index=1)LOCATION '${location}/index=1' > """ > val conf = new SparkConf().setAppName("scala").setMaster("local[2]") > conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse") > val ctx = new SparkContext(conf) > val hctx = new HiveContext(ctx) > val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table) > if (!exist) { > hctx.sql(create) > hctx.sql(add_part) > } else { > hctx.sql("show partitions " + table).show() > } > hctx.sql("select * from test_json").show() > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29742) dev/lint-java cann't check all code we will use
angerszhu created SPARK-29742: - Summary: dev/lint-java cann't check all code we will use Key: SPARK-29742 URL: https://issues.apache.org/jira/browse/SPARK-29742 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: angerszhu `dev/lint-java` cann't cover all code we will use -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29599) Support pagination for session table in JDBC/ODBC Tab
[ https://issues.apache.org/jira/browse/SPARK-29599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29599: -- Summary: Support pagination for session table in JDBC/ODBC Tab (was: Support pagination for session table in JDBC/ODBC Session page ) > Support pagination for session table in JDBC/ODBC Tab > -- > > Key: SPARK-29599 > URL: https://issues.apache.org/jira/browse/SPARK-29599 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Minor > > Support pagination for session table in JDBC/ODBC Session page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29599) Support pagination for session table in JDBC/ODBC Session page
[ https://issues.apache.org/jira/browse/SPARK-29599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959454#comment-16959454 ] angerszhu commented on SPARK-29599: --- work on this. > Support pagination for session table in JDBC/ODBC Session page > --- > > Key: SPARK-29599 > URL: https://issues.apache.org/jira/browse/SPARK-29599 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Minor > > Support pagination for session table in JDBC/ODBC Session page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29599) Support pagination for session table in JDBC/ODBC Session page
angerszhu created SPARK-29599: - Summary: Support pagination for session table in JDBC/ODBC Session page Key: SPARK-29599 URL: https://issues.apache.org/jira/browse/SPARK-29599 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu Support pagination for session table in JDBC/ODBC Session page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29540) Thrift in some cases can't parse string to date
[ https://issues.apache.org/jira/browse/SPARK-29540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956863#comment-16956863 ] angerszhu commented on SPARK-29540: --- check on this. > Thrift in some cases can't parse string to date > --- > > Key: SPARK-29540 > URL: https://issues.apache.org/jira/browse/SPARK-29540 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > I'm porting tests from PostgreSQL window.sql but anything related to casting > a string to datetime seems to fail on Thrift. For instance, the following > does not work: > {code:sql} > CREATE TABLE empsalary ( > > depname string, > > empno integer, > > salary int, > > enroll_date date > > ) USING parquet; > INSERT INTO empsalary VALUES ('develop', 10, 5200, '2007-08-01'); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29530) SparkSession.sql() method parse process not under current sparksession's conf
angerszhu created SPARK-29530: - Summary: SparkSession.sql() method parse process not under current sparksession's conf Key: SPARK-29530 URL: https://issues.apache.org/jira/browse/SPARK-29530 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: SparkSession.sql() method parse process not under current sparksession's conf Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode
[ https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29492: -- Description: Add UT in HiveThriftBinaryServerSuit: {code} test("jar in sync mode") { withCLIServiceClient { client => val user = System.getProperty("user.name") val sessionHandle = client.openSession(user, "") val confOverlay = new java.util.HashMap[java.lang.String, java.lang.String] val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath Seq(s"ADD JAR $jarFile", "CREATE TABLE smallKV(key INT, val STRING)", s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE smallKV") .foreach(query => client.executeStatement(sessionHandle, query, confOverlay)) client.executeStatement(sessionHandle, """CREATE TABLE addJar(key string) |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' """.stripMargin, confOverlay) client.executeStatement(sessionHandle, "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", confOverlay) val operationHandle = client.executeStatement( sessionHandle, "SELECT key FROM addJar", confOverlay) // Fetch result first time assertResult(1, "Fetching result first time from next row") { val rows_next = client.fetchResults( operationHandle, FetchOrientation.FETCH_NEXT, 1000, FetchType.QUERY_OUTPUT) rows_next.numRows() } } } {code} Run it then got ClassNotFound error. was:HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table. > SparkThriftServer can't support jar class as table serde class when > executestatement in sync mode > -- > > Key: SPARK-29492 > URL: https://issues.apache.org/jira/browse/SPARK-29492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Add UT in HiveThriftBinaryServerSuit: > {code} > test("jar in sync mode") { > withCLIServiceClient { client => > val user = System.getProperty("user.name") > val sessionHandle = client.openSession(user, "") > val confOverlay = new java.util.HashMap[java.lang.String, > java.lang.String] > val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath > Seq(s"ADD JAR $jarFile", > "CREATE TABLE smallKV(key INT, val STRING)", > s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE > smallKV") > .foreach(query => client.executeStatement(sessionHandle, query, > confOverlay)) > client.executeStatement(sessionHandle, > """CREATE TABLE addJar(key string) > |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > """.stripMargin, confOverlay) > client.executeStatement(sessionHandle, > "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", > confOverlay) > val operationHandle = client.executeStatement( > sessionHandle, > "SELECT key FROM addJar", > confOverlay) > // Fetch result first time > assertResult(1, "Fetching result first time from next row") { > val rows_next = client.fetchResults( > operationHandle, > FetchOrientation.FETCH_NEXT, > 1000, > FetchType.QUERY_OUTPUT) > rows_next.numRows() > } > } > } > {code} > Run it then got ClassNotFound error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode
[ https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952804#comment-16952804 ] angerszhu edited comment on SPARK-29492 at 10/16/19 1:10 PM: - raise a pr soon Conennect by pyhive will use sync mode. was (Author: angerszhuuu): raise a pr soon > SparkThriftServer can't support jar class as table serde class when > executestatement in sync mode > -- > > Key: SPARK-29492 > URL: https://issues.apache.org/jira/browse/SPARK-29492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Add UT in HiveThriftBinaryServerSuit: > {code} > test("jar in sync mode") { > withCLIServiceClient { client => > val user = System.getProperty("user.name") > val sessionHandle = client.openSession(user, "") > val confOverlay = new java.util.HashMap[java.lang.String, > java.lang.String] > val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath > Seq(s"ADD JAR $jarFile", > "CREATE TABLE smallKV(key INT, val STRING)", > s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE > smallKV") > .foreach(query => client.executeStatement(sessionHandle, query, > confOverlay)) > client.executeStatement(sessionHandle, > """CREATE TABLE addJar(key string) > |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > """.stripMargin, confOverlay) > client.executeStatement(sessionHandle, > "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", > confOverlay) > val operationHandle = client.executeStatement( > sessionHandle, > "SELECT key FROM addJar", > confOverlay) > // Fetch result first time > assertResult(1, "Fetching result first time from next row") { > val rows_next = client.fetchResults( > operationHandle, > FetchOrientation.FETCH_NEXT, > 1000, > FetchType.QUERY_OUTPUT) > rows_next.numRows() > } > } > } > {code} > Run it then got ClassNotFound error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode
[ https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29492: -- Summary: SparkThriftServer can't support jar class as table serde class when executestatement in sync mode (was: HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.) > SparkThriftServer can't support jar class as table serde class when > executestatement in sync mode > -- > > Key: SPARK-29492 > URL: https://issues.apache.org/jira/browse/SPARK-29492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.
[ https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952804#comment-16952804 ] angerszhu commented on SPARK-29492: --- raise a pr soon > HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table. > - > > Key: SPARK-29492 > URL: https://issues.apache.org/jira/browse/SPARK-29492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.
[ https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29492: -- Description: HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table. > HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table. > - > > Key: SPARK-29492 > URL: https://issues.apache.org/jira/browse/SPARK-29492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.
angerszhu created SPARK-29492: - Summary: HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table. Key: SPARK-29492 URL: https://issues.apache.org/jira/browse/SPARK-29492 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949886#comment-16949886 ] angerszhu commented on SPARK-29295: --- This probelm seems start from hive 1.2 I test in our env hive-1.1, won't have this problem. > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949050#comment-16949050 ] angerszhu commented on SPARK-29354: --- [~Elixir Kook] i download spark-2.4.4-bin-hadoop2.7 in your link and I can found jline-2.14.6.jar in jars/ folder > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948410#comment-16948410 ] angerszhu commented on SPARK-29354: --- [~Elixir Kook] [~yumwang] Jline is brought by hive-beeline module , you can find dependency in hive-beeline's pom file. And build liek below : {code:java} ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn -Phadoop-provided -Phive-provided {code} You won't get a jline jar in dist/jar/ folder > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948399#comment-16948399 ] angerszhu commented on SPARK-29424: --- [~srowen] Since resource limit is established, these bad behavior will cause program run very slow but don't know why. Make it abort early is better for user to recognize where the problem is. Especially for Spark Thrift Server. > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So I add a constraint when submit tasks and abort stage early when TaskSet > size num is bigger then set limit . I wonder if the community will accept > this way. > cc [~srowen] [~dongjoon] [~yumwang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29409) spark drop partition always throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948368#comment-16948368 ] angerszhu commented on SPARK-29409: --- Thanks, I will check this problem. > spark drop partition always throws Exception > > > Key: SPARK-29409 > URL: https://issues.apache.org/jira/browse/SPARK-29409 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: spark 2.4.0 on yarn 2.7.3 > spark-sql client mode > run hive version: 2.1.1 > hive builtin version 1.2.1 >Reporter: ant_nebula >Priority: Major > > The table is: > {code:java} > CREATE TABLE `test_spark.test_drop_partition`( > `platform` string, > `product` string, > `cnt` bigint) > PARTITIONED BY (dt string) > stored as orc;{code} > hive 2.1.1: > {code:java} > spark-sql -e "alter table test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > hive builtin: > {code:java} > spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf > spark.sql.hive.metastore.jars=builtin -e "alter table > test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > both would log Exception: > {code:java} > 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current > connections: 1 > 19/10/09 18:21:27 INFO metastore: Connected to metastore. > 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost > connection. Attempting to reconnect. > org.apache.thrift.transport.TTransportException: Cannot write to null > outputStream > at > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142) > at > org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178) > at > org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
[jira] [Commented] (SPARK-29288) Spark SQL add jar can't support HTTP path.
[ https://issues.apache.org/jira/browse/SPARK-29288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948364#comment-16948364 ] angerszhu commented on SPARK-29288: --- [~dongjoon] Sorry for later reply , the hive Jira is https://issues.apache.org/jira/browse/HIVE-9664 > Spark SQL add jar can't support HTTP path. > --- > > Key: SPARK-29288 > URL: https://issues.apache.org/jira/browse/SPARK-29288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > SparkSQL > `ADD JAR` can't support url with http, livy schema , do we need to support it? > cc [~sro...@scient.com] > [~hyukjin.kwon][~dongjoon][~jerryshao][~juliuszsompolski] > Hive 2.3 support it, do we need to support it? > I can work on this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29424: -- Description: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So I add a constraint when submit tasks and abort stage early when TaskSet size num is bigger then set limit . I wonder if the community will accept this way. cc [~srowen] [~dongjoon] [~yumwang] was: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So I add a constraint when submit tasks and abort stage early when TaskSet size num is bigger then set limit . I wonder if the community will accept this way. cc [~srowen] > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So I add a constraint when submit tasks and abort stage early when TaskSet > size num is bigger then set limit . I wonder if the community will accept > this way. > cc [~srowen] [~dongjoon] [~yumwang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29424: -- Description: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So I add a constraint when submit tasks and abort stage early when TaskSet size num is bigger then set limit . I wonder if the community will accept this way. cc [~srowen] was: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So i add a constraint when submit tasks.I wonder if the community will accept it > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So I add a constraint when submit tasks and abort stage early when TaskSet > size num is bigger then set limit . I wonder if the community will accept > this way. > cc [~srowen] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task
[ https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29424: -- Description: Our user always submit bad SQL in query platform, Such as : # write wrong join condition but submit that sql # write wrong where condition # etc.. This case will make Spark scheduler to submit a lot of task. It will cause spark run very slow and impact other user(spark thrift server) even run out of memory because of too many object generated by a big num of tasks. So i add a constraint when submit tasks.I wonder if the community will accept it > Prevent Spark to committing stage of too much Task > -- > > Key: SPARK-29424 > URL: https://issues.apache.org/jira/browse/SPARK-29424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > Our user always submit bad SQL in query platform, Such as : > # write wrong join condition but submit that sql > # write wrong where condition > # etc.. > This case will make Spark scheduler to submit a lot of task. It will cause > spark run very slow and impact other user(spark thrift server) even run out > of memory because of too many object generated by a big num of tasks. > So i add a constraint when submit tasks.I wonder if the community will accept > it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29424) Prevent Spark to committing stage of too much Task
angerszhu created SPARK-29424: - Summary: Prevent Spark to committing stage of too much Task Key: SPARK-29424 URL: https://issues.apache.org/jira/browse/SPARK-29424 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.
[ https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948247#comment-16948247 ] angerszhu commented on SPARK-29354: --- I will check on this. > Spark has direct dependency on jline, but binaries for 'without hadoop' > don't have a jline jar file. > - > > Key: SPARK-29354 > URL: https://issues.apache.org/jira/browse/SPARK-29354 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4 > Environment: From spark 2.3.x, spark 2.4.x >Reporter: Sungpeo Kook >Priority: Minor > > Spark has direct dependency on jline, included in the root pom.xml > but binaries for 'without hadoop' don't have a jline jar file. > > spark 2.2.x has the jline jar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29409) spark drop partition always throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948145#comment-16948145 ] angerszhu commented on SPARK-29409: --- Hive build in version and run version. > spark drop partition always throws Exception > > > Key: SPARK-29409 > URL: https://issues.apache.org/jira/browse/SPARK-29409 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ant_nebula >Priority: Major > > The table is: > {code:java} > CREATE TABLE `test_spark.test_drop_partition`( > `platform` string, > `product` string, > `cnt` bigint) > PARTITIONED BY (dt string) > stored as orc;{code} > hive 2.1.1: > {code:java} > spark-sql -e "alter table test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > hive builtin: > {code:java} > spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf > spark.sql.hive.metastore.jars=builtin -e "alter table > test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > both would log Exception: > {code:java} > 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current > connections: 1 > 19/10/09 18:21:27 INFO metastore: Connected to metastore. > 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost > connection. Attempting to reconnect. > org.apache.thrift.transport.TTransportException: Cannot write to null > outputStream > at > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142) > at > org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178) > at > org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at >
[jira] [Comment Edited] (SPARK-29409) spark drop partition always throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947767#comment-16947767 ] angerszhu edited comment on SPARK-29409 at 10/9/19 2:54 PM: Can you show more reproduce process? was (Author: angerszhuuu): i will check on this. > spark drop partition always throws Exception > > > Key: SPARK-29409 > URL: https://issues.apache.org/jira/browse/SPARK-29409 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ant_nebula >Priority: Major > > The table is: > {code:java} > CREATE TABLE `test_spark.test_drop_partition`( > `platform` string, > `product` string, > `cnt` bigint) > PARTITIONED BY (dt string) > stored as orc;{code} > hive 2.1.1: > {code:java} > spark-sql -e "alter table test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > hive builtin: > {code:java} > spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf > spark.sql.hive.metastore.jars=builtin -e "alter table > test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > both would log Exception: > {code:java} > 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current > connections: 1 > 19/10/09 18:21:27 INFO metastore: Connected to metastore. > 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost > connection. Attempting to reconnect. > org.apache.thrift.transport.TTransportException: Cannot write to null > outputStream > at > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142) > at > org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178) > at > org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at >
[jira] [Commented] (SPARK-29409) spark drop partition always throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947767#comment-16947767 ] angerszhu commented on SPARK-29409: --- i will check on this. > spark drop partition always throws Exception > > > Key: SPARK-29409 > URL: https://issues.apache.org/jira/browse/SPARK-29409 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ant_nebula >Priority: Major > > The table is: > {code:java} > CREATE TABLE `test_spark.test_drop_partition`( > `platform` string, > `product` string, > `cnt` bigint) > PARTITIONED BY (dt string) > stored as orc;{code} > hive 2.1.1: > {code:java} > spark-sql -e "alter table test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > hive builtin: > {code:java} > spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf > spark.sql.hive.metastore.jars=builtin -e "alter table > test_spark.test_drop_partition drop if exists > partition(dt='2019-10-08')"{code} > both would log Exception: > {code:java} > 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current > connections: 1 > 19/10/09 18:21:27 INFO metastore: Connected to metastore. > 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost > connection. Attempting to reconnect. > org.apache.thrift.transport.TTransportException: Cannot write to null > outputStream > at > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142) > at > org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178) > at > org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70) > at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265) > at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333) > at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at >
[jira] [Issue Comment Deleted] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
[ https://issues.apache.org/jira/browse/SPARK-29379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29379: -- Comment: was deleted (was: Don't need to add new expression class. If we just add code in ShowFunctionsCommand, we should change a lot UT about functions: {code:java} case class ShowFunctionsCommand( db: Option[String], pattern: Option[String], showUserFunctions: Boolean, showSystemFunctions: Boolean) extends RunnableCommand { override val output: Seq[Attribute] = { val schema = StructType(StructField("function", StringType, nullable = false) :: Nil) schema.toAttributes } override def run(sparkSession: SparkSession): Seq[Row] = { val dbName = db.getOrElse(sparkSession.sessionState.catalog.getCurrentDatabase) // If pattern is not specified, we use '*', which is used to // match any sequence of characters (including no characters). val functionNames = sparkSession.sessionState.catalog .listFunctions(dbName, pattern.getOrElse("*")) .collect { case (f, "USER") if showUserFunctions => f.unquotedString case (f, "SYSTEM") if showSystemFunctions => f.unquotedString } (functionNames ++ Seq("!=", "<>", "between", "case")).sorted.map(Row(_)) } } {code} ) > SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' > > > Key: SPARK-29379 > URL: https://issues.apache.org/jira/browse/SPARK-29379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
[ https://issues.apache.org/jira/browse/SPARK-29379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946526#comment-16946526 ] angerszhu commented on SPARK-29379: --- Don't need to add new expression class. If we just add code in ShowFunctionsCommand, we should change a lot UT about functions: {code:java} case class ShowFunctionsCommand( db: Option[String], pattern: Option[String], showUserFunctions: Boolean, showSystemFunctions: Boolean) extends RunnableCommand { override val output: Seq[Attribute] = { val schema = StructType(StructField("function", StringType, nullable = false) :: Nil) schema.toAttributes } override def run(sparkSession: SparkSession): Seq[Row] = { val dbName = db.getOrElse(sparkSession.sessionState.catalog.getCurrentDatabase) // If pattern is not specified, we use '*', which is used to // match any sequence of characters (including no characters). val functionNames = sparkSession.sessionState.catalog .listFunctions(dbName, pattern.getOrElse("*")) .collect { case (f, "USER") if showUserFunctions => f.unquotedString case (f, "SYSTEM") if showSystemFunctions => f.unquotedString } (functionNames ++ Seq("!=", "<>", "between", "case")).sorted.map(Row(_)) } } {code} > SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' > > > Key: SPARK-29379 > URL: https://issues.apache.org/jira/browse/SPARK-29379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
angerszhu created SPARK-29379: - Summary: SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' Key: SPARK-29379 URL: https://issues.apache.org/jira/browse/SPARK-29379 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29288) Spark SQL add jar can't support HTTP path.
[ https://issues.apache.org/jira/browse/SPARK-29288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941487#comment-16941487 ] angerszhu commented on SPARK-29288: --- [~dongjoon] Sorry for my mistake when report this issue. By the way, should we support ivy when add jar and file ? > Spark SQL add jar can't support HTTP path. > --- > > Key: SPARK-29288 > URL: https://issues.apache.org/jira/browse/SPARK-29288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > SparkSQL > `ADD JAR` can't support url with http, livy schema , do we need to support it? > cc [~sro...@scient.com] > [~hyukjin.kwon][~dongjoon][~jerryshao][~juliuszsompolski] > Hive 2.3 support it, do we need to support it? > I can work on this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29288) Spark SQL add jar can't support HTTP path.
[ https://issues.apache.org/jira/browse/SPARK-29288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-29288. --- Resolution: Not A Problem > Spark SQL add jar can't support HTTP path. > --- > > Key: SPARK-29288 > URL: https://issues.apache.org/jira/browse/SPARK-29288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > SparkSQL > `ADD JAR` can't support url with http, livy schema , do we need to support it? > cc [~sro...@scient.com] > [~hyukjin.kwon][~dongjoon][~jerryshao][~juliuszsompolski] > Hive 2.3 support it, do we need to support it? > I can work on this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29308) dev/deps/spark-deps-hadoop-3.2 orc jar is incorrect
[ https://issues.apache.org/jira/browse/SPARK-29308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29308: -- Description: In hadoop 3.2, orc.classfier is empty. https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/pom.xml#L2924 here is incorrect. https://github.com/apache/spark/blob/101839054276bfd52fdc29a98ffbf8e5c0383426/dev/deps/spark-deps-hadoop-3.2#L181-L182 was: In hadoop 3.2, orc.classfier is empty. https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/pom.xml#L2924 here is incorrect. > dev/deps/spark-deps-hadoop-3.2 orc jar is incorrect > > > Key: SPARK-29308 > URL: https://issues.apache.org/jira/browse/SPARK-29308 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > In hadoop 3.2, orc.classfier is empty. > https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/pom.xml#L2924 > here is incorrect. > https://github.com/apache/spark/blob/101839054276bfd52fdc29a98ffbf8e5c0383426/dev/deps/spark-deps-hadoop-3.2#L181-L182 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org