from:"angerszhu \(Jira\)"

[jira] [Commented] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095124#comment-17095124
 ] 

angerszhu commented on SPARK-31602:
---

cc [~cloud_fan] 

> memory leak of JobConf
> --
>
> Key: SPARK-31602
> URL: https://issues.apache.org/jira/browse/SPARK-31602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2020-04-29-14-34-39-496.png, 
> image-2020-04-29-14-35-55-986.png
>
>
> !image-2020-04-29-14-34-39-496.png!
> !image-2020-04-29-14-35-55-986.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31602:
--
Description: 
!image-2020-04-29-14-34-39-496.png!

!image-2020-04-29-14-35-55-986.png!

  was:
!image-2020-04-29-14-34-39-496.png!

Screen Shot 2020-04-29 at 2.08.28 PM


> memory leak of JobConf
> --
>
> Key: SPARK-31602
> URL: https://issues.apache.org/jira/browse/SPARK-31602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2020-04-29-14-34-39-496.png, 
> image-2020-04-29-14-35-55-986.png
>
>
> !image-2020-04-29-14-34-39-496.png!
> !image-2020-04-29-14-35-55-986.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31602:
--
Description: 
!image-2020-04-29-14-34-39-496.png!

Screen Shot 2020-04-29 at 2.08.28 PM

  was:
!image-2020-04-29-14-30-46-213.png!

 

!image-2020-04-29-14-30-55-964.png!


> memory leak of JobConf
> --
>
> Key: SPARK-31602
> URL: https://issues.apache.org/jira/browse/SPARK-31602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2020-04-29-14-34-39-496.png, 
> image-2020-04-29-14-35-55-986.png
>
>
> !image-2020-04-29-14-34-39-496.png!
> Screen Shot 2020-04-29 at 2.08.28 PM



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31602:
--
Attachment: image-2020-04-29-14-35-55-986.png

> memory leak of JobConf
> --
>
> Key: SPARK-31602
> URL: https://issues.apache.org/jira/browse/SPARK-31602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2020-04-29-14-34-39-496.png, 
> image-2020-04-29-14-35-55-986.png
>
>
> !image-2020-04-29-14-34-39-496.png!
> Screen Shot 2020-04-29 at 2.08.28 PM



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31602:
--
Attachment: image-2020-04-29-14-34-39-496.png

> memory leak of JobConf
> --
>
> Key: SPARK-31602
> URL: https://issues.apache.org/jira/browse/SPARK-31602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2020-04-29-14-34-39-496.png
>
>
> !image-2020-04-29-14-30-46-213.png!
>  
> !image-2020-04-29-14-30-55-964.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095122#comment-17095122
 ] 

angerszhu commented on SPARK-31602:
---

In HadoopRDD , if you don't set  spark.hadoop.cloneConf=true, it will put new 
JobConf to cached metadata

and won't remove, maybe we should add a clear method?
{code:java}
// Returns a JobConf that will be used on slaves to obtain input splits for 
Hadoop reads.
protected def getJobConf(): JobConf = {
  val conf: Configuration = broadcastedConf.value.value
  if (shouldCloneJobConf) {
// Hadoop Configuration objects are not thread-safe, which may lead to 
various problems if
// one job modifies a configuration while another reads it (SPARK-2546).  
This problem occurs
// somewhat rarely because most jobs treat the configuration as though it's 
immutable.  One
// solution, implemented here, is to clone the Configuration object.  
Unfortunately, this
// clone can be very expensive.  To avoid unexpected performance 
regressions for workloads and
// Hadoop versions that do not suffer from these thread-safety issues, this 
cloning is
// disabled by default.
HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
  logDebug("Cloning Hadoop Configuration")
  val newJobConf = new JobConf(conf)
  if (!conf.isInstanceOf[JobConf]) {
initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
  }
  newJobConf
}
  } else {
if (conf.isInstanceOf[JobConf]) {
  logDebug("Re-using user-broadcasted JobConf")
  conf.asInstanceOf[JobConf]
} else {
  Option(HadoopRDD.getCachedMetadata(jobConfCacheKey))
.map { conf =>
  logDebug("Re-using cached JobConf")
  conf.asInstanceOf[JobConf]
}
.getOrElse {
  // Create a JobConf that will be cached and used across this RDD's 
getJobConf() calls in
  // the local process. The local cache is accessed through 
HadoopRDD.putCachedMetadata().
  // The caching helps minimize GC, since a JobConf can contain ~10KB 
of temporary
  // objects. Synchronize to prevent ConcurrentModificationException 
(SPARK-1097,
  // HADOOP-10456).
  HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
logDebug("Creating new JobConf and caching it for later re-use")
val newJobConf = new JobConf(conf)
initLocalJobConfFuncOpt.foreach(f => f(newJobConf))
HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
newJobConf
}
  }
}
  }
}
{code}
No remove for this cached Job metadata
{code:java}
/**
 * The three methods below are helpers for accessing the local map, a property 
of the SparkEnv of
 * the local process.
 */
def getCachedMetadata(key: String): Any = 
SparkEnv.get.hadoopJobMetadata.get(key)

private def putCachedMetadata(key: String, value: Any): Unit =
  SparkEnv.get.hadoopJobMetadata.put(key, value)

{code}
for SQL on hive data, each partition will generate one JobConf, it's heave.

 

> memory leak of JobConf
> --
>
> Key: SPARK-31602
> URL: https://issues.apache.org/jira/browse/SPARK-31602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2020-04-29-14-34-39-496.png
>
>
> !image-2020-04-29-14-30-46-213.png!
>  
> !image-2020-04-29-14-30-55-964.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread angerszhu (Jira)

angerszhu created SPARK-31602:
-

 Summary: memory leak of JobConf
 Key: SPARK-31602
 URL: https://issues.apache.org/jira/browse/SPARK-31602
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: angerszhu


!image-2020-04-29-14-30-46-213.png!

 

!image-2020-04-29-14-30-55-964.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31334:
--
Comment: was deleted

(was: cc [~cloud_fan] [~yumwang] )

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074381#comment-17074381
 ] 

angerszhu commented on SPARK-31334:
---

I have found the reason, In analyzer when  logical plan
{code:java}
'Filter ('sum('a) > 3)
+- Aggregate [b#181], [b#181, sum(a#180) AS a#184L]
   +- SubqueryAlias `testdata`
  +- Project [_1#177 AS a#180, _2#178 AS b#181]
 +- LocalRelation [_1#177, _2#178]
{code}
come into ResolveAggregateFunctions, since a is String type and then 
aggregation's expression is unresolved, so  ResolveAggregateFunctions won't 
make a change on above logicalplan, then `sum(a)` in Filter condition will be 
resolved in ResolveReference and this {color:#FF}a 
{color}{color:#172b4d}will be resolved as aggregation's output column a , then 
error happened{color}{color:#FF} .{color}

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074296#comment-17074296
 ] 

angerszhu commented on SPARK-31334:
---

cc [~cloud_fan] [~yumwang] 

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31334:
--
Description: 
{code:java}
```

test("") {
Seq(
  (1, 3),
  (2, 3),
  (3, 6),
  (4, 7),
  (5, 9),
  (6, 9)
).toDF("a", "b").createOrReplaceTempView("testData")

val x = sql(
  """
| SELECT b, sum(a) as a
| FROM testData
| GROUP BY b
| HAVING sum(a) > 3
  """.stripMargin)

x.explain()
x.show()
  }

[info] -  *** FAILED *** (508 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, sum(cast(a#180 
as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. Attribute(s) with the same 
name appear in the operation: a. Please check if the right attribute(s) are 
used.;;
[info] Project [b#181, a#184]
[info] +- Filter (sum(a#184)#188 > cast(3 as double))
[info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
sum(a#184) AS sum(a#184)#188]
[info]   +- SubqueryAlias `testdata`
[info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
[info] +- LocalRelation [_1#177, _2#178]
```
```
test("") {
Seq(
  ("1", "3"),
  ("2", "3"),
  ("3", "6"),
  ("4", "7"),
  ("5", "9"),
  ("6", "9")
).toDF("a", "b").createOrReplaceTempView("testData")

val x = sql(
  """
| SELECT b, sum(a) as a
| FROM testData
| GROUP BY b
| HAVING sum(a) > 3
  """.stripMargin)

x.explain()
x.show()
  }


== Physical Plan ==
*(2) Project [b#181, a#184L]
+- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
as bigint))#197L > 3))
   +- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
  +- Exchange hashpartitioning(b#181, 5)
 +- *(1) HashAggregate(keys=[b#181], functions=[partial_sum(cast(a#180 
as bigint))])
+- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
   +- LocalTableScan [_1#177, _2#178]
```{code}
Spend A lot of time I can't find witch analyzer make this different,

When column type is double, it failed.

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by

[jira] [Created] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)

angerszhu created SPARK-31334:
-

 Summary: Use agg column in Having clause behave different with 
column type 
 Key: SPARK-31334
 URL: https://issues.apache.org/jira/browse/SPARK-31334
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31298) validate CTAS table path in SPARK-19724 seems conflict and External table also need to check non-empty

2020-03-29 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31298:
--
Summary: validate CTAS table path in SPARK-19724 seems conflict and 
External table also need to check non-empty  (was: validate CTAS table path in 
SPARK-19724 seems conflict)

> validate CTAS table path in SPARK-19724 seems conflict and External table 
> also need to check non-empty
> --
>
> Key: SPARK-31298
> URL: https://issues.apache.org/jira/browse/SPARK-31298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> In  SessionCatalog.validateTableLocation()
> {code:java}
> val tableLocation =
>   new 
> Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier)))
> {code}
> But in CreateDataSourceTableAsSelect , table location use defaultTablePath
> {code:java}
> assert(table.schema.isEmpty)
> sparkSession.sessionState.catalog.validateTableLocation(table)
> val tableLocation = if (table.tableType == CatalogTableType.MANAGED) {
>   Some(sessionState.catalog.defaultTablePath(table.identifier))
> } else {
>   table.storage.locationUri
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31298) validate CTAS table path in SPARK-19724 seems conflict

2020-03-29 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31298:
--
Summary: validate CTAS table path in SPARK-19724 seems conflict  (was: 
validate External path in SPARK-19724 seems conflict)

> validate CTAS table path in SPARK-19724 seems conflict
> --
>
> Key: SPARK-31298
> URL: https://issues.apache.org/jira/browse/SPARK-31298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> In  SessionCatalog.validateTableLocation()
> {code:java}
> val tableLocation =
>   new 
> Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier)))
> {code}
> But in CreateDataSourceTableAsSelect , table location use defaultTablePath
> {code:java}
> assert(table.schema.isEmpty)
> sparkSession.sessionState.catalog.validateTableLocation(table)
> val tableLocation = if (table.tableType == CatalogTableType.MANAGED) {
>   Some(sessionState.catalog.defaultTablePath(table.identifier))
> } else {
>   table.storage.locationUri
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31298) validate External path in SPARK-19724 seems conflict

2020-03-29 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31298:
--
Description: 
In  SessionCatalog.validateTableLocation()
{code:java}
val tableLocation =
  new 
Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier)))

{code}
But in CreateDataSourceTableAsSelect , table location use defaultTablePath
{code:java}
assert(table.schema.isEmpty)
sparkSession.sessionState.catalog.validateTableLocation(table)
val tableLocation = if (table.tableType == CatalogTableType.MANAGED) {
  Some(sessionState.catalog.defaultTablePath(table.identifier))
} else {
  table.storage.locationUri
}

{code}

> validate External path in SPARK-19724 seems conflict
> 
>
> Key: SPARK-31298
> URL: https://issues.apache.org/jira/browse/SPARK-31298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> In  SessionCatalog.validateTableLocation()
> {code:java}
> val tableLocation =
>   new 
> Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier)))
> {code}
> But in CreateDataSourceTableAsSelect , table location use defaultTablePath
> {code:java}
> assert(table.schema.isEmpty)
> sparkSession.sessionState.catalog.validateTableLocation(table)
> val tableLocation = if (table.tableType == CatalogTableType.MANAGED) {
>   Some(sessionState.catalog.defaultTablePath(table.identifier))
> } else {
>   table.storage.locationUri
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31298) validate External path in SPARK-19724 seems conflict

2020-03-29 Thread angerszhu (Jira)

angerszhu created SPARK-31298:
-

 Summary: validate External path in SPARK-19724 seems conflict
 Key: SPARK-31298
 URL: https://issues.apache.org/jira/browse/SPARK-31298
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval

2020-03-28 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069939#comment-17069939
 ] 

angerszhu commented on SPARK-31268:
---

[https://github.com/apache/spark/pull/28034]

> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval
> ---
>
> Key: SPARK-31268
> URL: https://issues.apache.org/jira/browse/SPARK-31268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval

2020-03-26 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067661#comment-17067661
 ] 

angerszhu commented on SPARK-31268:
---

raise a pr soon

> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval
> ---
>
> Key: SPARK-31268
> URL: https://issues.apache.org/jira/browse/SPARK-31268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31270) Expose executor memory metrics at the task detal, in the Stages tab

2020-03-26 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067657#comment-17067657
 ] 

angerszhu commented on SPARK-31270:
---

Raise a pr soon

> Expose executor memory metrics at the task detal, in the Stages tab
> ---
>
> Key: SPARK-31270
> URL: https://issues.apache.org/jira/browse/SPARK-31270
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31270) Expose executor memory metrics at the task detal, in the Stages tab

2020-03-26 Thread angerszhu (Jira)

angerszhu created SPARK-31270:
-

 Summary: Expose executor memory metrics at the task detal, in the 
Stages tab
 Key: SPARK-31270
 URL: https://issues.apache.org/jira/browse/SPARK-31270
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval

2020-03-26 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31268:
--
Attachment: screenshot-1.png

> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval
> ---
>
> Key: SPARK-31268
> URL: https://issues.apache.org/jira/browse/SPARK-31268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval

2020-03-26 Thread angerszhu (Jira)

angerszhu created SPARK-31268:
-

 Summary: TaskEnd event with zero Executor Metrics when task 
duration less then poll interval
 Key: SPARK-31268
 URL: https://issues.apache.org/jira/browse/SPARK-31268
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31268) TaskEnd event with zero Executor Metrics when task duration less then poll interval

2020-03-26 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31268:
--
Description: TaskEnd event with zero Executor Metrics when task duration 
less then poll interval

> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval
> ---
>
> Key: SPARK-31268
> URL: https://issues.apache.org/jira/browse/SPARK-31268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> TaskEnd event with zero Executor Metrics when task duration less then poll 
> interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26341) Expose executor memory metrics at the stage level, in the Stages tab

2020-03-25 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066572#comment-17066572
 ] 

angerszhu commented on SPARK-26341:
---

I have do this in our own version, will raise a pr soon these days.

> Expose executor memory metrics at the stage level, in the Stages tab
> 
>
> Key: SPARK-26341
> URL: https://issues.apache.org/jira/browse/SPARK-26341
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: Edward Lu
>Priority: Major
>
> Sub-task SPARK-23431 will add stage level executor memory metrics (peak 
> values for each stage, and peak values for each executor for the stage). This 
> information should also be exposed the the web UI, so that users can see 
> which stages are memory intensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31226) SizeBasedCoalesce logic error

2020-03-23 Thread angerszhu (Jira)

angerszhu created SPARK-31226:
-

 Summary: SizeBasedCoalesce logic error
 Key: SPARK-31226
 URL: https://issues.apache.org/jira/browse/SPARK-31226
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31226) SizeBasedCoalesce logic error

2020-03-23 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31226:
--
Description: 
In spark UT, 

SizeBasedCoalecse's logic is wrong

> SizeBasedCoalesce logic error
> -
>
> Key: SPARK-31226
> URL: https://issues.apache.org/jira/browse/SPARK-31226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Minor
>
> In spark UT, 
> SizeBasedCoalecse's logic is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2020-03-21 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063909#comment-17063909
 ] 

angerszhu commented on SPARK-27097:
---

[~irashid] to be honest, I meet this problem these days.

 

[~dbtsai] I have some question. 
We start a self-developed thrift server program  and use spark as compute 
engine with below javaOptions parameter
 
{color:#e14141}-Xmx64g {color}
{color:#e14141}-Djava.library.path=/home/hadoop/hadoop/lib/native {color}
{color:#e14141}-Djavax.security.auth.useSubjectCredsOnly=false {color}
{color:#e14141}-Dcom.sun.management.jmxremote.port=9021 {color}
{color:#e14141}-Dcom.sun.management.jmxremote.authenticate=false {color}
{color:#e14141}-Dcom.sun.management.jmxremote.ssl=false {color}
{color:#e14141}-XX:MaxPermSize=1024m -XX:PermSize=256m 
-XX:MaxDirectMemorySize=8192m -XX:-TraceClassUnloading {color}
{color:#e14141}-XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection 
-XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled 
-XX:+DisableExplicitGC -XX:+PrintTenuringDistribution 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 
-Xnoclassgc -XX:+PrintGCDetails -XX:+PrintGCDateStamps {color}
{color:#e14141} {color}
{color:#e14141} {color}
Then the {color:#347eec}Platform{color}{color:#e14141}.{color} 
{color:#347eec}BYTE_ARRAY_OFFSET{color} will be 24, when we start a normal 
spark thrift server, the value will be 16, this problem cause strange data 
corruption. 
After few days check, I located the problem because of spark  *codegen*， and  
this pr can fix our problem , but I can’t find  evidence why 
Platform.BYTE_ARRAY_OFFSET will be 24 in above parameter. Since I test in local 
that when we set  {color:#e14141} -XX:+ UseCompressedOops,  {color} using 
pointer compression it's going to be 16.
{color:#e14141} -XX:- UseCompressedOops,  {color} not using pointer compression 
it's going to be 24. This is easy to understand why the offset is not same.
But I don’t know why above parameter will be 24 since I am not a professor  
about java compiler and  Basic computer knowledge.
 
Can you give me some advisor or information about how to understand and find 
the root cause.
 

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.3, 2.2.3, 2.3.4, 2.4.0
>Reporter: Xiao Li
>Assignee: Kris Mok
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.1
>
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
>

[jira] [Commented] (SPARK-30707) Lead/Lag window function throws AnalysisException without ORDER BY clause

2020-03-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055642#comment-17055642
 ] 

angerszhu commented on SPARK-30707:
---

add pr in [https://github.com/apache/spark/pull/27861]

> Lead/Lag window function throws AnalysisException without ORDER BY clause
> -
>
> Key: SPARK-30707
> URL: https://issues.apache.org/jira/browse/SPARK-30707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
>  Lead/Lag window function throws AnalysisException without ORDER BY clause:
> {code:java}
> SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four
> FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s
> org.apache.spark.sql.AnalysisException
> Window function lead(ten#x, (four#x + 1), null) requires window to be 
> ordered, please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + 
> 1), null)(value_expr) OVER (PARTITION BY window_partition ORDER BY 
> window_ordering) from table;
> {code}
>  
> Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30707) Lead/Lag window function throws AnalysisException without ORDER BY clause

2020-03-09 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1709#comment-1709
 ] 

angerszhu commented on SPARK-30707:
---

our production meet this problem too when hive sql run in spark engine, I am 
working on this and will raise a  pr soon

> Lead/Lag window function throws AnalysisException without ORDER BY clause
> -
>
> Key: SPARK-30707
> URL: https://issues.apache.org/jira/browse/SPARK-30707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
>  Lead/Lag window function throws AnalysisException without ORDER BY clause:
> {code:java}
> SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four
> FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s
> org.apache.spark.sql.AnalysisException
> Window function lead(ten#x, (four#x + 1), null) requires window to be 
> ordered, please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + 
> 1), null)(value_expr) OVER (PARTITION BY window_partition ORDER BY 
> window_ordering) from table;
> {code}
>  
> Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30694) If exception occured while fetching blocks by ExternalBlockClient, fail early when External Shuffle Service is not alive

2020-01-31 Thread angerszhu (Jira)

angerszhu created SPARK-30694:
-

 Summary: If exception occured while fetching blocks by 
ExternalBlockClient, fail early when External Shuffle Service is not alive
 Key: SPARK-30694
 URL: https://issues.apache.org/jira/browse/SPARK-30694
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30538) A not very elegant way to control ouput small file

2020-01-16 Thread angerszhu (Jira)

angerszhu created SPARK-30538:
-

 Summary: A  not very elegant way to control ouput small file 
 Key: SPARK-30538
 URL: https://issues.apache.org/jira/browse/SPARK-30538
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30435) update Spark SQL guide of Supported Hive Features

2020-01-06 Thread angerszhu (Jira)

angerszhu created SPARK-30435:
-

 Summary: update Spark SQL guide of Supported Hive Features
 Key: SPARK-30435
 URL: https://issues.apache.org/jira/browse/SPARK-30435
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29800) Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf

2020-01-05 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29800:
--
Summary: Rewrite non-correlated EXISTS subquery use ScalaSubquery to 
optimize perf  (was: Rewrite non-correlated subquery use ScalaSubquery to 
optimize perf)

> Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf
> -
>
> Key: SPARK-29800
> URL: https://issues.apache.org/jira/browse/SPARK-29800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29800) Rewrite non-correlated subquery use ScalaSubquery to optimize perf

2020-01-02 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29800:
--
Summary: Rewrite non-correlated subquery use ScalaSubquery to optimize perf 
 (was: Plan Exists 's subquery in PlanSubqueries)

> Rewrite non-correlated subquery use ScalaSubquery to optimize perf
> --
>
> Key: SPARK-29800
> URL: https://issues.apache.org/jira/browse/SPARK-29800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

2019-12-18 Thread angerszhu (Jira)

[
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

angerszhu updated SPARK-29018:
--
Description:
h2. Background

With the development of Spark and Hive，in current sql/hive-thriftserver
module, we need to do a lot of work to solve code conflicts for different
built-in hive versions. It's an annoying and unending work in current ways. And
these issues have limited our ability and convenience to develop new features
for Spark’s thrift server.

We suppose to implement a new thrift server and JDBC driver based on Hive’s
latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server
have below feature:
# Build new module spark-service as spark’s thrift server
# Don't need as much reflection and inherited code as `hive-thriftser` modules
# Support all functions current `sql/hive-thriftserver` support
# Use all code maintained by spark itself, won’t depend on Hive
# Support origin functions use spark’s own way, won't limited by Hive's code
# Support running without hive metastore or with hive metastore
# Support user impersonation by Multi-tenant splited hive authentication and
DFS authentication
# Support session hook for with spark’s own code
# Add a new jdbc driver spark-jdbc, with spark’s own connection url
“jdbc:spark::/”
# Support both hive-jdbc and spark-jdbc client, then we can support most
clients and BI platform

h2. How to start?

We can start this new thrift server by shell
*sbin/start-spark-thriftserver.sh* and stop it by
*sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to
determine the characteristics of the current spark thrift server service, we
have implemented all need configuration by spark itself in
`org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used to
connect to hive metastore. We can write all we needed conf in
*conf/spark-default.conf* or in startup command *--conf*
h2. How to connect through jdbc?

Now we support both hive-jdbc and spark-jdbc, user can choose which one he
likes
h3. spark-jdbc
# use `SparkDriver` as jdbc driver class
# Connection url
`jdbc:spark://:,:/dbName;sess_var_list?conf_list#var_list`
most samse as hive but with spark’s special url prefix `jdbc:spark`
# For proxy, use SparkDriver should set proxy conf
`spark.sql.thriftserver.proxy.user=username`

h3. hive-jdbc
# use `HiveDriver` as jdbc driver class
# connection str
jdbc:hive2://:,:/dbName;sess_var_list?conf_list#var_list
as origin
# For proxy, use HiveDriver should set proxy conf
hive.server2.proxy.user=username, current server support both config

h2. How is it done today, and what are the limits of current practice?
h3. Current practice

We have completed two modules `spark-service` & `spark-jdbc` now, it can run
well and we have add origin UT to it these two module and it can pass the UT,
for impersonation, we have write the code and test it in our kerberized
environment, it can work well and wait for review. Now we will raise pr to
apace/spark master branch step by step.
h3. Here are some known changes:
# Don’t use any hive code in `spark-service` `spark-jdbc` module
# In current service, default rcfile suffix `.hiverc` was replaced by
`.sparkrc`
# When use SparkDriver as jdbc driver class, url should use
jdbc:spark://:,:/dbName;sess_var_list?conf_list#var_list
# When use SparkDriver as jdbc driver class, proxy conf should be
`spark.sql.thriftserver.proxy.user=proxy_user_name`
# Support `hiveconf` `hivevar` session conf through hive-jdbc connection

h2. What are the risks?

Totally new module, won’t change other module’s code except for supporting
impersonation. Except impersonation, we have added a lot of UT changed (fit
grammar without hive) from origin UT, and all pass it. For impersonation I have
test it in our kerberized environment but still need detail review since change
a lot.
h2. How long will it take?

We have done all these works in our own repo, now we plan merge our code
into the master step by step.
# Phase1 pr about build new module *spark-service* on folder *sql/service*
# Phase2 pr thrift protocol and generated thrift protocol java code
# Phase3 pr with all *spark-service* module code and description about design,
also Unnit Test
# Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
# Phase5 pr with all *spark-jdbc* module code and Unit Tests
# Phase6 pr about support thriftserver Impersonation
# Phase7 pr about build spark's own beeline client *spark-beeline*
# Phase8 pr about spark's own CLI client code to support *Spark SQL CLI*
module named *spark-cli*

h3. Appendix A. Proposed API Changes. Optional section defining APIs changes,
if any. Backward and forward compatibility must be taken into account.

Compared to current `sql/hive-thriftserver`, corresponding API changes as
below:

# Add a new

[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

2019-12-18 Thread angerszhu (Jira)

[
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

angerszhu updated SPARK-29018:
--
Description:
SPIP:Build Spark thrift server based on thrift protocol v11
h2. Background

h2. How to start?

Now we support both hive-jdbc and spark-jdbc, user can choose which one he
likes

h3. spark-jdbc

# use `SparkDriver` as jdbc driver class
# Connection url
`jdbc:spark://:,:/dbName;sess_var_list?conf_list#var_list`
most samse as hive but with spark’s special url prefix `jdbc:spark`
# For proxy, use SparkDriver should set proxy conf
`spark.sql.thriftserver.proxy.user=username`

h3. hive-jdbc

# use `HiveDriver` as jdbc driver class
# connection str
jdbc:hive2://:,:/dbName;sess_var_list?conf_list#var_list
as origin
# For proxy, use HiveDriver should set proxy conf
hive.server2.proxy.user=username, current server support both config

h2. How is it done today, and what are the limits of current practice?
h3. Current practice

# Don’t use any hive code in `spark-service` `spark-jdbc` module
# In current service, default rcfile suffix `.hiverc` was replaced by
`.sparkrc`
# When use SparkDriver as jdbc driver class, url should use
jdbc:spark://:,:/dbName;sess_var_list?conf_list#var_list
# When use SparkDriver as jdbc driver class, proxy conf should be
`spark.sql.thriftserver.proxy.user=proxy_user_name`
# Support `hiveconf` `hivevar` session conf through hive-jdbc connection
#

h2. What are the risks?

h2. How long will it take?

h3. Appendix A. Proposed API Changes. Optional section defining APIs changes,
if any. Backward and forward compatibility must be taken into account.

Compared to

[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

2019-12-18 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29018:
--
Description: 
With the development of Spark and Hive，in current sql/hive-thriftserver module, 
we need to do a lot of work to solve code conflicts for different built-in hive 
versions. It's an annoying and unending work in current ways. And these issues 
have limited our ability and convenience to develop new features for Spark’s 
thrift server. 

We suppose to implement a new thrift server and JDBC driver based on Hive’s 
latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server 
have below feature:
 # Build new module spark-service as spark’s thrift server 
 # Don't need as much reflection and inherited code as `hive-thriftser` modules
 # Support all functions current `sql/hive-thriftserver` support
 # Use all code maintained by spark itself, won’t depend on Hive
 # Support origin functions use spark’s own way, won't limited by Hive's code
 # Support running without hive metastore or with hive metastore
 # Support user impersonation by Multi-tenant splited hive authentication and 
DFS authentication
 # Support session hook for with spark’s own code
 # Add a new jdbc driver spark-jdbc, with spark’s own connection url  
“jdbc:spark::/”
 # Support both hive-jdbc and spark-jdbc client, then we can support most 
clients and BI platform

  was:
With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
need to do a lot of work to solve code conflicts between different hive 
versions. It's an annoying and unending work in current ways. And these issues 
are troubling us when we develop new features for the SparkThriftServer2. We 
suppose to implement a new thrift server based on latest v11 
`TCLService.thrift` thrift protocol. Implement all API in spark's own code to 
get rid of hive code .
 Finally, the new thrift server have below feature:
 # support all functions current `hive-thriftserver` support
 # use all code maintained by spark itself
 # realize origin function fit to spark’s own feature, won't limited by hive's 
code
 # support running without hive metastore or with hive metastore
 # support user impersonation by Multi-tenant authority separation, hive 
authentication and DFS authentication
 # add a new module `spark-jdbc`, with connection url 
`jdbc:spark::/`, all `hive-jdbc` support we will all support
 # support both `hive-jdbc` and `spark-jdbc` client for compatibility with most 
clients

We have done all these works in our repo, now we plan merge our code into the 
master step by step.
 # *phase1* pr about build new module `spark-service` on folder `sql/service`
 2. *phase2* pr thrift protocol and generated thrift protocol java code 
 3. *phase3* pr with all `spark-service` module code and description about 
design, also UT
 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc`
 5. *phase5* pr with all `spark-jdbc` module code and UT
 6. *phase6* pr about support thriftserver Impersonation
 7. *phase7* pr about build spark's own beeline client `spark-beeline`
 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli`

 

 

Google Doc : 
[https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.im79wwkzycsr]


> Build spark thrift server on it's own code based on protocol v11
> 
>
> Key: SPARK-29018
> URL: https://issues.apache.org/jira/browse/SPARK-29018
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> With the development of Spark and Hive，in current sql/hive-thriftserver 
> module, we need to do a lot of work to solve code conflicts for different 
> built-in hive versions. It's an annoying and unending work in current ways. 
> And these issues have limited our ability and convenience to develop new 
> features for Spark’s thrift server. 
> We suppose to implement a new thrift server and JDBC driver based on 
> Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift 
> server have below feature:
>  # Build new module spark-service as spark’s thrift server 
>  # Don't need as much reflection and inherited code as `hive-thriftser` 
> modules
>  # Support all functions current `sql/hive-thriftserver` support
>  # Use all code maintained by spark itself, won’t depend on Hive
>  # Support origin functions use spark’s own way, won't limited by Hive's code
>  # Support running without hive metastore or with hive metastore
>  # Support user impersonation by Multi-tenant splited hive authentication and 
> DFS authentication
>  # Support session hook for with spark’s own code
>  # Add a new jdbc driver spark-jdbc, with spark’s own connection url  
>

[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

2019-12-18 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29018:
--
Summary: Build spark thrift server on it's own code based on protocol v11  
(was: Spark ThriftServer change to it's own API)

> Build spark thrift server on it's own code based on protocol v11
> 
>
> Key: SPARK-29018
> URL: https://issues.apache.org/jira/browse/SPARK-29018
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
> need to do a lot of work to solve code conflicts between different hive 
> versions. It's an annoying and unending work in current ways. And these 
> issues are troubling us when we develop new features for the 
> SparkThriftServer2. We suppose to implement a new thrift server based on 
> latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's 
> own code to get rid of hive code .
>  Finally, the new thrift server have below feature:
>  # support all functions current `hive-thriftserver` support
>  # use all code maintained by spark itself
>  # realize origin function fit to spark’s own feature, won't limited by 
> hive's code
>  # support running without hive metastore or with hive metastore
>  # support user impersonation by Multi-tenant authority separation, hive 
> authentication and DFS authentication
>  # add a new module `spark-jdbc`, with connection url 
> `jdbc:spark::/`, all `hive-jdbc` support we will all support
>  # support both `hive-jdbc` and `spark-jdbc` client for compatibility with 
> most clients
> We have done all these works in our repo, now we plan merge our code into the 
> master step by step.
>  # *phase1* pr about build new module `spark-service` on folder `sql/service`
>  2. *phase2* pr thrift protocol and generated thrift protocol java code 
>  3. *phase3* pr with all `spark-service` module code and description about 
> design, also UT
>  4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc`
>  5. *phase5* pr with all `spark-jdbc` module code and UT
>  6. *phase6* pr about support thriftserver Impersonation
>  7. *phase7* pr about build spark's own beeline client `spark-beeline`
>  8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli`
>  
>  
> Google Doc : 
> [https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.im79wwkzycsr]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30293) In HiveThriftServer2 will remove wrong statement

2019-12-17 Thread angerszhu (Jira)

angerszhu created SPARK-30293:
-

 Summary: In HiveThriftServer2 will remove wrong statement
 Key: SPARK-30293
 URL: https://issues.apache.org/jira/browse/SPARK-30293
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu


In HiveThriftServer2 will remove wrong statement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29018) Spark ThriftServer change to it's own API

2019-12-17 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29018:
--
Description: 
With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
need to do a lot of work to solve code conflicts between different hive 
versions. It's an annoying and unending work in current ways. And these issues 
are troubling us when we develop new features for the SparkThriftServer2. We 
suppose to implement a new thrift server based on latest v11 
`TCLService.thrift` thrift protocol. Implement all API in spark's own code to 
get rid of hive code .
 Finally, the new thrift server have below feature:
 # support all functions current `hive-thriftserver` support
 # use all code maintained by spark itself
 # realize origin function fit to spark’s own feature, won't limited by hive's 
code
 # support running without hive metastore or with hive metastore
 # support user impersonation by Multi-tenant authority separation, hive 
authentication and DFS authentication
 # add a new module `spark-jdbc`, with connection url 
`jdbc:spark::/`, all `hive-jdbc` support we will all support
 # support both `hive-jdbc` and `spark-jdbc` client for compatibility with most 
clients

We have done all these works in our repo, now we plan merge our code into the 
master step by step.
 # *phase1* pr about build new module `spark-service` on folder `sql/service`
 2. *phase2* pr thrift protocol and generated thrift protocol java code 
 3. *phase3* pr with all `spark-service` module code and description about 
design, also UT
 4. *phase4* pr about build new module `spark-jdbc` on folder `sql/jdbc`
 5. *phase5* pr with all `spark-jdbc` module code and UT
 6. *phase6* pr about support thriftserver Impersonation
 7. *phase7* pr about build spark's own beeline client `spark-beeline`
 8. *phase8* pr about spark's own Cli client `Spark SQL CLI` `spark-cli`

 

 

Google Doc : 
[https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.im79wwkzycsr]

  was:
With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
need to do a lot of work to solve code conflicts between different hive 
versions. It's an annoying and unending work in current ways. And these issues 
are troubling us when we develop new features for the SparkThriftServer2. We 
suppose to implement a new thrift server based on latest v11 
`TCLService.thrift` thrift protocol. Implement all API in spark's own code to 
get rid of hive code .
Finally, the new thrift server have below feature:

# support all functions current `hive-thriftserver` support
# use all code maintained by spark itself
# realize origin function fit to spark’s own feature, won't limited by hive's 
code
# support running without hive metastore or with hive metastore
# support user impersonation by Multi-tenant authority separation,  hive 
authentication and DFS authentication
# add a new module `spark-jdbc`, with connection url  
`jdbc:spark::/`, all `hive-jdbc` support we will all support
# support both `hive-jdbc` and `spark-jdbc` client for compatibility with most 
clients


We have done all these works in our repo, now we plan  merge our code into the 
master step by step.  

1.  *phase1*  pr about build new module  `spark-service` on folder `sql/service`
2. *phase2*  pr thrift protocol and generated thrift protocol java code 
3. *phase3*  pr with all `spark-service` module code  and description about 
design, also UT
4. *phase4*  pr about build new module `spark-jdbc` on folder `sql/jdbc`
5. *phase5*  pr with all `spark-jdbc` module code  and UT
6. *phase6*  pr about support thriftserver Impersonation
7. *phase7*   pr about build spark's own beeline client `spark-beeline`
8. *phase8*  pr about spark's own Cli client `Spark SQL CLI` `spark-cli`




> Spark ThriftServer change to it's own API
> -
>
> Key: SPARK-29018
> URL: https://issues.apache.org/jira/browse/SPARK-29018
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
> need to do a lot of work to solve code conflicts between different hive 
> versions. It's an annoying and unending work in current ways. And these 
> issues are troubling us when we develop new features for the 
> SparkThriftServer2. We suppose to implement a new thrift server based on 
> latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's 
> own code to get rid of hive code .
>  Finally, the new thrift server have below feature:
>  # support all functions current `hive-thriftserver` support
>  # use all code maintained by spark itself
>  # realize origin function fit to spark’s own feature, won't

[jira] [Updated] (SPARK-30287) Add new module spark-service as thrift server module

2019-12-17 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-30287:
--
Description: Add a new module of `sql/service` as spark thrift server 
module  (was: Add a new module of thriftserver)

> Add new module spark-service as thrift server module
> 
>
> Key: SPARK-30287
> URL: https://issues.apache.org/jira/browse/SPARK-30287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Add a new module of `sql/service` as spark thrift server module



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29018) Spark ThriftServer change to it's own API

2019-12-17 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29018:
--
Description: 
With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
need to do a lot of work to solve code conflicts between different hive 
versions. It's an annoying and unending work in current ways. And these issues 
are troubling us when we develop new features for the SparkThriftServer2. We 
suppose to implement a new thrift server based on latest v11 
`TCLService.thrift` thrift protocol. Implement all API in spark's own code to 
get rid of hive code .
Finally, the new thrift server have below feature:

# support all functions current `hive-thriftserver` support
# use all code maintained by spark itself
# realize origin function fit to spark’s own feature, won't limited by hive's 
code
# support running without hive metastore or with hive metastore
# support user impersonation by Multi-tenant authority separation,  hive 
authentication and DFS authentication
# add a new module `spark-jdbc`, with connection url  
`jdbc:spark::/`, all `hive-jdbc` support we will all support
# support both `hive-jdbc` and `spark-jdbc` client for compatibility with most 
clients


We have done all these works in our repo, now we plan  merge our code into the 
master step by step.  

1.  *phase1*  pr about build new module  `spark-service` on folder `sql/service`
2. *phase2*  pr thrift protocol and generated thrift protocol java code 
3. *phase3*  pr with all `spark-service` module code  and description about 
design, also UT
4. *phase4*  pr about build new module `spark-jdbc` on folder `sql/jdbc`
5. *phase5*  pr with all `spark-jdbc` module code  and UT
6. *phase6*  pr about support thriftserver Impersonation
7. *phase7*   pr about build spark's own beeline client `spark-beeline`
8. *phase8*  pr about spark's own Cli client `Spark SQL CLI` `spark-cli`



  was:
With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
need to do a lot of work to solve code conflicts between different hive 
versions. It's an annoying and unending work in current ways. And these issues 
are troubling us when we develop new features for the SparkThriftServer2. We 
suppose to implement a new thrift server based on latest v11 
`TCLService.thrift` thrift protocol. Implement all API in spark's own code to 
get rid of hive code .
Finally, the new thrift server have below feature:
#. support all functions current `hive-thriftserver` support
#. use all code maintained by spark itself
3. realize origin function fit to spark’s own feature, won't limited by hive's 
code
4. support running without hive metastore or with hive metastore
5. support user impersonation by Multi-tenant authority separation,  hive 
authentication and DFS authentication
6. add a new module `spark-jdbc`, with connection url  
`jdbc:spark::/`, all `hive-jdbc` support we will all support
7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with most 
clients


We have done all these works in our repo, now we plan  merge our code into the 
master step by step.  

1.  *phase1*  pr about build new module  `spark-service` on folder `sql/service`
2. *phase2*  pr thrift protocol and generated thrift protocol java code 
3. *phase3*  pr with all `spark-service` module code  and description about 
design, also UT
4. *phase4*  pr about build new module `spark-jdbc` on folder `sql/jdbc`
5. *phase5*  pr with all `spark-jdbc` module code  and UT
6. *phase6*  pr about support thriftserver Impersonation
7. *phase7*   pr about build spark's own beeline client `spark-beeline`
8. *phase8*  pr about spark's own Cli client `Spark SQL CLI` `spark-cli`




> Spark ThriftServer change to it's own API
> -
>
> Key: SPARK-29018
> URL: https://issues.apache.org/jira/browse/SPARK-29018
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
> need to do a lot of work to solve code conflicts between different hive 
> versions. It's an annoying and unending work in current ways. And these 
> issues are troubling us when we develop new features for the 
> SparkThriftServer2. We suppose to implement a new thrift server based on 
> latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's 
> own code to get rid of hive code .
> Finally, the new thrift server have below feature:
> # support all functions current `hive-thriftserver` support
> # use all code maintained by spark itself
> # realize origin function fit to spark’s own feature, won't limited by hive's 
> code
> # support running without hive metastore or with hive metastore
> # support user

[jira] [Updated] (SPARK-29018) Spark ThriftServer change to it's own API

2019-12-17 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29018:
--
Description: 
With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
need to do a lot of work to solve code conflicts between different hive 
versions. It's an annoying and unending work in current ways. And these issues 
are troubling us when we develop new features for the SparkThriftServer2. We 
suppose to implement a new thrift server based on latest v11 
`TCLService.thrift` thrift protocol. Implement all API in spark's own code to 
get rid of hive code .
Finally, the new thrift server have below feature:
#. support all functions current `hive-thriftserver` support
#. use all code maintained by spark itself
3. realize origin function fit to spark’s own feature, won't limited by hive's 
code
4. support running without hive metastore or with hive metastore
5. support user impersonation by Multi-tenant authority separation,  hive 
authentication and DFS authentication
6. add a new module `spark-jdbc`, with connection url  
`jdbc:spark::/`, all `hive-jdbc` support we will all support
7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with most 
clients


We have done all these works in our repo, now we plan  merge our code into the 
master step by step.  

1.  *phase1*  pr about build new module  `spark-service` on folder `sql/service`
2. *phase2*  pr thrift protocol and generated thrift protocol java code 
3. *phase3*  pr with all `spark-service` module code  and description about 
design, also UT
4. *phase4*  pr about build new module `spark-jdbc` on folder `sql/jdbc`
5. *phase5*  pr with all `spark-jdbc` module code  and UT
6. *phase6*  pr about support thriftserver Impersonation
7. *phase7*   pr about build spark's own beeline client `spark-beeline`
8. *phase8*  pr about spark's own Cli client `Spark SQL CLI` `spark-cli`



  was:
### What changes were proposed in this pull request?

With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
need to do a lot of work to solve code conflicts between different hive 
versions. It's an annoying and unending work in current ways. And these issues 
are troubling us when we develop new features for the SparkThriftServer2. We 
suppose to implement a new thrift server based on latest v11 
`TCLService.thrift` thrift protocol. Implement all API in spark's own code to 
get rid of hive code .
Finally, the new thrift server have below feature:
1. support all functions current `hive-thriftserver` support
2. use all code maintained by spark itself
3. realize origin function fit to spark’s own feature, won't limited by hive's 
code
4. support running without hive metastore or with hive metastore
5. support user impersonation by Multi-tenant authority separation,  hive 
authentication and DFS authentication
6. add a new module `spark-jdbc`, with connection url  
`jdbc:spark::/`, all `hive-jdbc` support we will all support
7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with most 
clients


We have done all these works in our repo, now we plan  merge our code into the 
master step by step.  

1.  **phase1**  pr about build new module  `spark-service` on folder 
`sql/service`
2. **phase2**  pr thrift protocol and generated thrift protocol java code 
3. **phase3**  pr with all `spark-service` module code  and description about 
design, also UT
4. **phase4**  pr about build new module `spark-jdbc` on folder `sql/jdbc`
5. **phase5**  pr with all `spark-jdbc` module code  and UT
6. **phase6**  pr about support thriftserver Impersonation
7. **phase7**   pr about build spark's own beeline client `spark-beeline`
8. **phase8**  pr about spark's own Cli client `Spark SQL CLI` `spark-cli`

### Why are the changes needed?

Build a totally new thrift server base on spark's own code and feature.Don't 
rely on hive code anymore


### Does this PR introduce any user-facing change?


### How was this patch tested?
Not need  UT now



> Spark ThriftServer change to it's own API
> -
>
> Key: SPARK-29018
> URL: https://issues.apache.org/jira/browse/SPARK-29018
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
> need to do a lot of work to solve code conflicts between different hive 
> versions. It's an annoying and unending work in current ways. And these 
> issues are troubling us when we develop new features for the 
> SparkThriftServer2. We suppose to implement a new thrift server based on 
> latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's 
> own code to get rid of hive code .
> Finally, the

[jira] [Updated] (SPARK-29018) Spark ThriftServer change to it's own API

2019-12-17 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29018:
--
Description: 
### What changes were proposed in this pull request?

With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
need to do a lot of work to solve code conflicts between different hive 
versions. It's an annoying and unending work in current ways. And these issues 
are troubling us when we develop new features for the SparkThriftServer2. We 
suppose to implement a new thrift server based on latest v11 
`TCLService.thrift` thrift protocol. Implement all API in spark's own code to 
get rid of hive code .
Finally, the new thrift server have below feature:
1. support all functions current `hive-thriftserver` support
2. use all code maintained by spark itself
3. realize origin function fit to spark’s own feature, won't limited by hive's 
code
4. support running without hive metastore or with hive metastore
5. support user impersonation by Multi-tenant authority separation,  hive 
authentication and DFS authentication
6. add a new module `spark-jdbc`, with connection url  
`jdbc:spark::/`, all `hive-jdbc` support we will all support
7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with most 
clients


We have done all these works in our repo, now we plan  merge our code into the 
master step by step.  

1.  **phase1**  pr about build new module  `spark-service` on folder 
`sql/service`
2. **phase2**  pr thrift protocol and generated thrift protocol java code 
3. **phase3**  pr with all `spark-service` module code  and description about 
design, also UT
4. **phase4**  pr about build new module `spark-jdbc` on folder `sql/jdbc`
5. **phase5**  pr with all `spark-jdbc` module code  and UT
6. **phase6**  pr about support thriftserver Impersonation
7. **phase7**   pr about build spark's own beeline client `spark-beeline`
8. **phase8**  pr about spark's own Cli client `Spark SQL CLI` `spark-cli`

### Why are the changes needed?

Build a totally new thrift server base on spark's own code and feature.Don't 
rely on hive code anymore


### Does this PR introduce any user-facing change?


### How was this patch tested?
Not need  UT now


  was:
Current SparkThriftServer rely on HiveServer2 too much, when Hive version 
changed, we should change a lot to fit for Hive code change.

We would best just use Hive's thrift interface to implement it 's own API for 
SparkThriftServer. 

And remove unused code logical [for Spark Thrift Server]. 


> Spark ThriftServer change to it's own API
> -
>
> Key: SPARK-29018
> URL: https://issues.apache.org/jira/browse/SPARK-29018
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> ### What changes were proposed in this pull request?
> With the development of Spark and Hive，in current `sql/hive-thriftserver`, we 
> need to do a lot of work to solve code conflicts between different hive 
> versions. It's an annoying and unending work in current ways. And these 
> issues are troubling us when we develop new features for the 
> SparkThriftServer2. We suppose to implement a new thrift server based on 
> latest v11 `TCLService.thrift` thrift protocol. Implement all API in spark's 
> own code to get rid of hive code .
> Finally, the new thrift server have below feature:
> 1. support all functions current `hive-thriftserver` support
> 2. use all code maintained by spark itself
> 3. realize origin function fit to spark’s own feature, won't limited by 
> hive's code
> 4. support running without hive metastore or with hive metastore
> 5. support user impersonation by Multi-tenant authority separation,  hive 
> authentication and DFS authentication
> 6. add a new module `spark-jdbc`, with connection url  
> `jdbc:spark::/`, all `hive-jdbc` support we will all support
> 7. support both `hive-jdbc` and `spark-jdbc` client for compatibility with 
> most clients
> We have done all these works in our repo, now we plan  merge our code into 
> the master step by step.  
> 1.  **phase1**  pr about build new module  `spark-service` on folder 
> `sql/service`
> 2. **phase2**  pr thrift protocol and generated thrift protocol java code 
> 3. **phase3**  pr with all `spark-service` module code  and description about 
> design, also UT
> 4. **phase4**  pr about build new module `spark-jdbc` on folder `sql/jdbc`
> 5. **phase5**  pr with all `spark-jdbc` module code  and UT
> 6. **phase6**  pr about support thriftserver Impersonation
> 7. **phase7**   pr about build spark's own beeline client `spark-beeline`
> 8. **phase8**  pr about spark's own Cli client `Spark SQL CLI` `spark-cli`
> ### Why are the changes needed?
> Build a totally new thrift server base on

[jira] [Created] (SPARK-30287) Add new module spark-service as thrift server module

2019-12-17 Thread angerszhu (Jira)

angerszhu created SPARK-30287:
-

 Summary: Add new module spark-service as thrift server module
 Key: SPARK-30287
 URL: https://issues.apache.org/jira/browse/SPARK-30287
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu


Add a new module of thriftserver



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30271) dynamic allocation won't release some executor in some case.

2019-12-16 Thread angerszhu (Jira)

angerszhu created SPARK-30271:
-

 Summary: dynamic allocation won't release some executor in some 
case.
 Key: SPARK-30271
 URL: https://issues.apache.org/jira/browse/SPARK-30271
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.4.0
Reporter: angerszhu


Case :
max executor 5
min executor 0
idle time  5s

stage-1 10 tasks run in 5 executors.
If stage-1 finished in 5 all executors, all executor added to `removeTimes` 
when taskEnd event.
After 5s, start release process, since stage-2 have 20 tasks, then executor 
won't be removed since existing executor num < executorTargetNum., and executor 
will be removed from `removeTimes`. 
But if task won't be scheduled to all these executors, if executor-1 won't have 
task to run in it, it won't be put into `removeTimes` and if there are no more 
tasks, executor won't be removed forever



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16180) Task hang on fetching blocks (cached RDD)

2019-12-16 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997310#comment-16997310
 ] 

angerszhu commented on SPARK-16180:
---

i meet this problem recently in spark-2.4

> Task hang on fetching blocks (cached RDD)
> -
>
> Key: SPARK-16180
> URL: https://issues.apache.org/jira/browse/SPARK-16180
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.1
>Reporter: Davies Liu
>Priority: Major
>  Labels: bulk-closed
>
> Here is the stackdump of executor:
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:107)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
> org.apache.spark.scheduler.Task.run(Task.scala:96)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30237) Move `sql()` method from DataType to AbstractType

2019-12-12 Thread angerszhu (Jira)

angerszhu created SPARK-30237:
-

 Summary: Move `sql()` method from DataType to AbstractType
 Key: SPARK-30237
 URL: https://issues.apache.org/jira/browse/SPARK-30237
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu


Move `sql()` method from DataType to AbstractType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-11 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994061#comment-16994061
 ] 

angerszhu edited comment on SPARK-30223 at 12/12/19 1:49 AM:
-

[~cloud_fan]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

Maybe we should sort out the use of `SQLConf.get()` like 
https://github.com/apache/spark/pull/26187


was (Author: angerszhuuu):
[~cloud_fan]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

> queries in thrift server may read wrong SQL configs
> ---
>
> Key: SPARK-30223
> URL: https://issues.apache.org/jira/browse/SPARK-30223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The Spark thrift server creates many SparkSessions to serve requests, and the 
> thrift server serves requests using a single thread. One thread can only have 
> one active SparkSession, so SQLCong.get can't get the proper conf from the 
> session that runs the query.
> Whenever we issue an action on a SparkSession, we should set this session as 
> active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-11 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994061#comment-16994061
 ] 

angerszhu edited comment on SPARK-30223 at 12/12/19 1:49 AM:
-

[~cloud_fan] [~yumwang]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

Maybe we should sort out the use of `SQLConf.get()` like 
https://github.com/apache/spark/pull/26187


was (Author: angerszhuuu):
[~cloud_fan]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

Maybe we should sort out the use of `SQLConf.get()` like 
https://github.com/apache/spark/pull/26187

> queries in thrift server may read wrong SQL configs
> ---
>
> Key: SPARK-30223
> URL: https://issues.apache.org/jira/browse/SPARK-30223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The Spark thrift server creates many SparkSessions to serve requests, and the 
> thrift server serves requests using a single thread. One thread can only have 
> one active SparkSession, so SQLCong.get can't get the proper conf from the 
> session that runs the query.
> Whenever we issue an action on a SparkSession, we should set this session as 
> active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-11 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994061#comment-16994061
 ] 

angerszhu commented on SPARK-30223:
---

[~cloud_fan]
Add 
```
SparkSession.setActiveSession(sqlContext.sparkSession) 
```
in each Metadata Operation?

> queries in thrift server may read wrong SQL configs
> ---
>
> Key: SPARK-30223
> URL: https://issues.apache.org/jira/browse/SPARK-30223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The Spark thrift server creates many SparkSessions to serve requests, and the 
> thrift server serves requests using a single thread. One thread can only have 
> one active SparkSession, so SQLCong.get can't get the proper conf from the 
> session that runs the query.
> Whenever we issue an action on a SparkSession, we should set this session as 
> active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20135) spark thriftserver2: no job running but containers not release on yarn

2019-12-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992644#comment-16992644
 ] 

angerszhu commented on SPARK-20135:
---

meet same problem in spark2.4.0 
[~xwc3504]
Have you have any idel now?

> spark thriftserver2: no job running but containers not release on yarn
> --
>
> Key: SPARK-20135
> URL: https://issues.apache.org/jira/browse/SPARK-20135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark 2.0.1 with hadoop 2.6.0 
>Reporter: bruce xu
>Priority: Major
> Attachments: 0329-1.png, 0329-2.png, 0329-3.png
>
>
> i opened the executor dynamic allocation feature, however it doesn't work 
> sometimes.
> i set the initial executor num 50,  after job finished the cores and mem 
> resource did not release. 
> from the spark web UI, the active job/running task/stage num is 0 , but the 
> executors page show  cores 1276, active task 7288.
> from the yarn web UI,  the thriftserver job's running containers is 639 
> without releasing. 
> this may be a bug. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails

2019-11-22 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980128#comment-16980128
 ] 

angerszhu commented on SPARK-29998:
---

[~cloud_fan] No, not same problem but all caused by disk broken. 
In my case, it happened when starting task and get broadcast value.
His case is happened when shuffle data. His pr can't fix this, but seems we can 
fix this problem in similar ways, make it retry and choose a right folder.

> A corrupted hard disk causes the task to execute repeatedly on a machine 
> until the job fails
> 
>
> Key: SPARK-29998
> URL: https://issues.apache.org/jira/browse/SPARK-29998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> Recently, I meet one situation:
> One NodeManager's disk is broken. when task begin to run, it will get jobConf 
> by broadcast, executor's BlockManager failed to create file. and throw 
> IOException.
> {code}
> 19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: 
> "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due 
> to Job aborted due to stage failure: Task 21 in st
> age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 
> (TID 34968, hostname, executor 104): java.io.IOException: Failed to create 
> local dir in /disk
> 11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b.
> at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
> at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:228)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)

[jira] [Commented] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails

2019-11-22 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980024#comment-16980024
 ] 

angerszhu commented on SPARK-29998:
---

[~yumwang][~dongjoon] [~cloud_fan]

For this problem, I think we have two way to fix this.
# If this situation happened , make execute failed
# defined a special exception and handle it like `FetchFailed`

WDYT?

> A corrupted hard disk causes the task to execute repeatedly on a machine 
> until the job fails
> 
>
> Key: SPARK-29998
> URL: https://issues.apache.org/jira/browse/SPARK-29998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> Recently, I meet one situation:
> One NodeManager's disk is broken. when task begin to run, it will get jobConf 
> by broadcast, executor's BlockManager failed to create file. and throw 
> IOException.
> {code}
> 19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: 
> "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due 
> to Job aborted due to stage failure: Task 21 in st
> age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 
> (TID 34968, hostname, executor 104): java.io.IOException: Failed to create 
> local dir in /disk
> 11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b.
> at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
> at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:228)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

[jira] [Updated] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails

2019-11-22 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29998:
--
Description: 
Recently, I meet one situation:

One NodeManager's disk is broken. when task begin to run, it will get jobConf 
by broadcast, executor's BlockManager failed to create file. and throw 
IOException.
{code}
19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: 
"ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due to 
Job aborted due to stage failure: Task 21 in st
age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 
(TID 34968, hostname, executor 104): java.io.IOException: Failed to create 
local dir in /disk
11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b.
at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129)
at 
org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:228)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

{code}

Since in TaskSetManager.handleFailedTask()
For this kind of fail reason, it will retry on this Executor until `failedTime 
> maxTaskFailTime `
Then this stage failed, total job failed.

  was:
Recently, I meet one situation:

One NodeManager's disk is broken. when task begin to run, it will get jobConf 
by broadcast, executor's BlockManager failed to create file. and throw 
IOException.
```
19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: 
"ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due

[jira] [Updated] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails

2019-11-22 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29998:
--
Description: 
Recently, I meet one situation:

One NodeManager's disk is broken. when task begin to run, it will get jobConf 
by broadcast, executor's BlockManager failed to create file. and throw 
IOException.
```
19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: 
"ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due to 
Job aborted due to stage failure: Task 21 in st
age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 
(TID 34968, hostname, executor 104): java.io.IOException: Failed to create 
local dir in /disk
11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b.
at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129)
at 
org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:228)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

```

Since in TaskSetManager.handleFailedTask()
For this kind of fail reason, it will retry on this Executor until `failedTime 
> maxTaskFailTime `
Then this stage failed, total job failed.

> A corrupted hard disk causes the task to execute repeatedly on a machine 
> until the job fails
> 
>
> Key: SPARK-29998
> URL: https://issues.apache.org/jira/browse/SPARK-29998
> Project: Spark
>  Issue Type:

[jira] [Created] (SPARK-29998) A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails

2019-11-22 Thread angerszhu (Jira)

angerszhu created SPARK-29998:
-

 Summary: A corrupted hard disk causes the task to execute 
repeatedly on a machine until the job fails
 Key: SPARK-29998
 URL: https://issues.apache.org/jira/browse/SPARK-29998
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29957) Bump MiniKdc to 3.2.0

2019-11-19 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29957:
--
Description: 
Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
New encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
these encryption types and does not work well when these encryption types are 
enabled, which results in the authentication failure.

  was:
ince MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
New encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
these encryption types and does not work well when these encryption types are 
enabled, which results in the authentication failure.


> Bump MiniKdc to 3.2.0
> -
>
> Key: SPARK-29957
> URL: https://issues.apache.org/jira/browse/SPARK-29957
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
> New encryption types of es128-cts-hmac-sha256-128 and 
> aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
> Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
> these encryption types and does not work well when these encryption types are 
> enabled, which results in the authentication failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29957) Bump MiniKdc to 3.2.0

2019-11-19 Thread angerszhu (Jira)

angerszhu created SPARK-29957:
-

 Summary: Bump MiniKdc to 3.2.0
 Key: SPARK-29957
 URL: https://issues.apache.org/jira/browse/SPARK-29957
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
Reporter: angerszhu


ince MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
New encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
these encryption types and does not work well when these encryption types are 
enabled, which results in the authentication failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29874) Optimize Dataset.isEmpty()

2019-11-13 Thread angerszhu (Jira)

angerszhu created SPARK-29874:
-

 Summary: Optimize Dataset.isEmpty()
 Key: SPARK-29874
 URL: https://issues.apache.org/jira/browse/SPARK-29874
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29800) Plan Exists 's subquery in PlanSubqueries

2019-11-08 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970040#comment-16970040
 ] 

angerszhu commented on SPARK-29800:
---

raise pr soon

> Plan Exists 's subquery in PlanSubqueries
> -
>
> Key: SPARK-29800
> URL: https://issues.apache.org/jira/browse/SPARK-29800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29800) Plan Exists 's subquery in PlanSubqueries

2019-11-08 Thread angerszhu (Jira)

angerszhu created SPARK-29800:
-

 Summary: Plan Exists 's subquery in PlanSubqueries
 Key: SPARK-29800
 URL: https://issues.apache.org/jira/browse/SPARK-29800
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"

2019-11-07 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29769:
--
Description: 
 In origin master, we can'y run sql  use `EXISTS/NOT EXISTS` in Join's on 
condition:
{code}
create temporary view s1 as select * from values
(1), (3), (5), (7), (9)
  as s1(id);

create temporary view s2 as select * from values
(1), (3), (4), (6), (9)
  as s2(id);

create temporary view s3 as select * from values
(3), (4), (6), (9)
  as s3(id);

 explain extended SELECT s1.id, s2.id as id2 FROM s1
 LEFT OUTER JOIN s2 ON s1.id = s2.id
 AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
 
we will get

== Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- 'UnresolvedRelation `s1`
   +- 'UnresolvedRelation `s2`

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

== Physical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]
Time taken: 1.455 seconds, Fetched 1 row(s)
{code}

Since in analyzer , it won't solve join's condition's SubQuery in 
*Analyzer.ResolveSubquery*， then table *s3* was unresolved. 

After pr https://github.com/apache/spark/pull/25854/files

We will solve subqueries in join condition and it will pass analyzer level.

In current master, if we run sql above, we will get
{code}
 == Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#291]
+- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation [s3]
   :- 'UnresolvedRelation [s1]
   +- 'UnresolvedRelation [s2]

== Analyzed Logical Plan ==
id: int, id2: int
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
   :  +- Project [id#256]
   : +- Filter (id#256 > 6)
   :+- SubqueryAlias `s3`
   :   +- Project [value#253 AS id#256]
   :  +- LocalRelation [value#253]
   :- SubqueryAlias `s1`
   :  +- Project [value#241 AS id#244]
   : +- LocalRelation [value#241]
   +- SubqueryAlias `s2`
  +- Project [value#247 AS id#250]
 +- LocalRelation [value#247]

== Optimized Logical Plan ==
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
   :  +- Project [value#253 AS id#256]
   : +- Filter (value#253 > 6)
   :+- LocalRelation [value#253]
   :- Project [value#241 AS id#244]
   :  +- LocalRelation [value#241]
   +- Project [value#247 AS id#250]
  +- LocalRelation [value#247]

== Physical Plan ==
*(2) Project [id#244, id#250 AS id2#291]
+- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 
[]
   :  +- Project [value#253 AS id#256]
   : +- Filter (value#253 > 6)
   :+- LocalRelation [value#253]
   :- *(2) Project [value#241 AS id#244]
   :  +- *(2) LocalTableScan [value#241]
   +-

[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"

2019-11-07 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29769:
--
Description: 
 In origin master, we can'y run sql  use `EXISTS/NOT EXISTS` in Join's on 
condition:
{code}
create temporary view s1 as select * from values
(1), (3), (5), (7), (9)
  as s1(id);

create temporary view s2 as select * from values
(1), (3), (4), (6), (9)
  as s2(id);

create temporary view s3 as select * from values
(3), (4), (6), (9)
  as s3(id);

 explain extended SELECT s1.id, s2.id as id2 FROM s1
 LEFT OUTER JOIN s2 ON s1.id = s2.id
 AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
 
we will get

== Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- 'UnresolvedRelation `s1`
   +- 'UnresolvedRelation `s2`

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]

== Physical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 
pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   : +- SubqueryAlias `s1`
   :+- LocalRelation [id#0]
   +- SubqueryAlias `s2`
  +- Project [id#1]
 +- SubqueryAlias `s2`
+- LocalRelation [id#1]
Time taken: 1.455 seconds, Fetched 1 row(s)
{code}

Since in analyzer , it won't solve join's condition's SubQuery in 
*Analyzer.ResolveSubquery*， then table *s3* was unresolved. 

After pr https://github.com/apache/spark/pull/25854/files

We will solve subqueries in join condition and it will pass analyzer level.

In current master, if we run sql above, we will get
{code}
 == Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#291]
+- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
   :  +- 'Project [*]
   : +- 'Filter ('s3.id > 6)
   :+- 'UnresolvedRelation [s3]
   :- 'UnresolvedRelation [s1]
   +- 'UnresolvedRelation [s2]

== Analyzed Logical Plan ==
id: int, id2: int
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
   :  +- Project [id#256]
   : +- Filter (id#256 > 6)
   :+- SubqueryAlias `s3`
   :   +- Project [value#253 AS id#256]
   :  +- LocalRelation [value#253]
   :- SubqueryAlias `s1`
   :  +- Project [value#241 AS id#244]
   : +- LocalRelation [value#241]
   +- SubqueryAlias `s2`
  +- Project [value#247 AS id#250]
 +- LocalRelation [value#247]

== Optimized Logical Plan ==
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
   :  +- Project [value#253 AS id#256]
   : +- Filter (value#253 > 6)
   :+- LocalRelation [value#253]
   :- Project [value#241 AS id#244]
   :  +- LocalRelation [value#241]
   +- Project [value#247 AS id#250]
  +- LocalRelation [value#247]

== Physical Plan ==
*(2) Project [id#244, id#250 AS id2#291]
+- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 
[]
   :  +- Project [value#253 AS id#256]
   : +- Filter (value#253 > 6)
   :+- LocalRelation [value#253]
   :- *(2) Project [value#241 AS id#244]
   :  +- *(2) LocalTableScan [value#241]
   +-

[jira] [Reopened] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"

2019-11-06 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu reopened SPARK-29769:
---

> Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
> ---
>
> Key: SPARK-29769
> URL: https://issues.apache.org/jira/browse/SPARK-29769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"

2019-11-05 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-29769.
---
Resolution: Invalid

> Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
> ---
>
> Key: SPARK-29769
> URL: https://issues.apache.org/jira/browse/SPARK-29769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"

2019-11-05 Thread angerszhu (Jira)

angerszhu created SPARK-29769:
-

 Summary: Spark SQL cannot handle "exists/not exists" condition 
when using "JOIN"
 Key: SPARK-29769
 URL: https://issues.apache.org/jira/browse/SPARK-29769
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17398) Failed to query on external JSon Partitioned table

2019-11-05 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-17398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968033#comment-16968033
 ] 

angerszhu commented on SPARK-17398:
---

[~bianqi]
Hi, I meet this problem too, have you know what's the problem? 

> Failed to query on external JSon Partitioned table
> --
>
> Key: SPARK-17398
> URL: https://issues.apache.org/jira/browse/SPARK-17398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: screenshot-1.png
>
>
> 1. Create External Json partitioned table 
> with SerDe in hive-hcatalog-core-1.2.1.jar, download fom
> https://mvnrepository.com/artifact/org.apache.hive.hcatalog/hive-hcatalog-core/1.2.1
> 2. Query table meet exception, which works in spark1.5.2
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: 
> Lost task
>  0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: 
> java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord
> at 
> org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:430)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>  
> 3. Test Code
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.hive.HiveContext
> object JsonBugs {
>   def main(args: Array[String]): Unit = {
> val table = "test_json"
> val location = "file:///g:/home/test/json"
> val create = s"""CREATE   EXTERNAL  TABLE  ${table}
>  (id string,  seq string )
>   PARTITIONED BY(index int)
>   ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
>   LOCATION "${location}" 
>   """
> val add_part = s"""
>  ALTER TABLE ${table} ADD 
>  PARTITION (index=1)LOCATION '${location}/index=1'
> """
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse")
> val ctx = new SparkContext(conf)
> val hctx = new HiveContext(ctx)
> val exist = hctx.tableNames().map { x => x.toLowerCase() }.contains(table)
> if (!exist) {
>   hctx.sql(create)
>   hctx.sql(add_part)
> } else {
>   hctx.sql("show partitions " + table).show()
> }
> hctx.sql("select * from test_json").show()
>   }
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29742) dev/lint-java cann't check all code we will use

2019-11-04 Thread angerszhu (Jira)

angerszhu created SPARK-29742:
-

 Summary: dev/lint-java cann't check all code we will use
 Key: SPARK-29742
 URL: https://issues.apache.org/jira/browse/SPARK-29742
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: angerszhu


`dev/lint-java` cann't cover all code we will use



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29599) Support pagination for session table in JDBC/ODBC Tab

2019-10-25 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29599:
--
Summary: Support pagination for session table in JDBC/ODBC Tab   (was: 
Support pagination for session table in JDBC/ODBC Session page )

> Support pagination for session table in JDBC/ODBC Tab 
> --
>
> Key: SPARK-29599
> URL: https://issues.apache.org/jira/browse/SPARK-29599
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Minor
>
> Support pagination for session table in JDBC/ODBC Session page 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29599) Support pagination for session table in JDBC/ODBC Session page

2019-10-25 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959454#comment-16959454
 ] 

angerszhu commented on SPARK-29599:
---

work on this.

> Support pagination for session table in JDBC/ODBC Session page 
> ---
>
> Key: SPARK-29599
> URL: https://issues.apache.org/jira/browse/SPARK-29599
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Minor
>
> Support pagination for session table in JDBC/ODBC Session page 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29599) Support pagination for session table in JDBC/ODBC Session page

2019-10-25 Thread angerszhu (Jira)

angerszhu created SPARK-29599:
-

 Summary: Support pagination for session table in JDBC/ODBC Session 
page 
 Key: SPARK-29599
 URL: https://issues.apache.org/jira/browse/SPARK-29599
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu


Support pagination for session table in JDBC/ODBC Session page 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29540) Thrift in some cases can't parse string to date

2019-10-22 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956863#comment-16956863
 ] 

angerszhu commented on SPARK-29540:
---

check on this.

> Thrift in some cases can't parse string to date
> ---
>
> Key: SPARK-29540
> URL: https://issues.apache.org/jira/browse/SPARK-29540
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> I'm porting tests from PostgreSQL window.sql but anything related to casting 
> a string to datetime seems to fail on Thrift. For instance, the following 
> does not work:
> {code:sql}
> CREATE TABLE empsalary (  
>   
> depname string,   
>   
> empno integer,
>   
> salary int,   
>   
> enroll_date date  
>   
> ) USING parquet;  
> INSERT INTO empsalary VALUES ('develop', 10, 5200, '2007-08-01');
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29530) SparkSession.sql() method parse process not under current sparksession's conf

2019-10-21 Thread angerszhu (Jira)

angerszhu created SPARK-29530:
-

 Summary: SparkSession.sql() method parse process not under current 
sparksession's conf
 Key: SPARK-29530
 URL: https://issues.apache.org/jira/browse/SPARK-29530
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: SparkSession.sql() method parse process not under current 
sparksession's conf
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode

2019-10-16 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29492:
--
Description: 
Add UT in HiveThriftBinaryServerSuit:

{code}
  test("jar in sync mode") {
withCLIServiceClient { client =>
  val user = System.getProperty("user.name")
  val sessionHandle = client.openSession(user, "")
  val confOverlay = new java.util.HashMap[java.lang.String, 
java.lang.String]
  val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath

  Seq(s"ADD JAR $jarFile",
"CREATE TABLE smallKV(key INT, val STRING)",
s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
smallKV")
.foreach(query => client.executeStatement(sessionHandle, query, 
confOverlay))

  client.executeStatement(sessionHandle,
"""CREATE TABLE addJar(key string)
  |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
""".stripMargin, confOverlay)

  client.executeStatement(sessionHandle,
"INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", 
confOverlay)


  val operationHandle = client.executeStatement(
sessionHandle,
"SELECT key FROM addJar",
confOverlay)

  // Fetch result first time
  assertResult(1, "Fetching result first time from next row") {

val rows_next = client.fetchResults(
  operationHandle,
  FetchOrientation.FETCH_NEXT,
  1000,
  FetchType.QUERY_OUTPUT)

rows_next.numRows()
  }
}
  }
{code}

Run it then got ClassNotFound error.

  was:HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created 
table.


> SparkThriftServer  can't support jar class as table serde class when 
> executestatement in sync mode
> --
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Add UT in HiveThriftBinaryServerSuit:
> {code}
>   test("jar in sync mode") {
> withCLIServiceClient { client =>
>   val user = System.getProperty("user.name")
>   val sessionHandle = client.openSession(user, "")
>   val confOverlay = new java.util.HashMap[java.lang.String, 
> java.lang.String]
>   val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath
>   Seq(s"ADD JAR $jarFile",
> "CREATE TABLE smallKV(key INT, val STRING)",
> s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
> smallKV")
> .foreach(query => client.executeStatement(sessionHandle, query, 
> confOverlay))
>   client.executeStatement(sessionHandle,
> """CREATE TABLE addJar(key string)
>   |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> """.stripMargin, confOverlay)
>   client.executeStatement(sessionHandle,
> "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", 
> confOverlay)
>   val operationHandle = client.executeStatement(
> sessionHandle,
> "SELECT key FROM addJar",
> confOverlay)
>   // Fetch result first time
>   assertResult(1, "Fetching result first time from next row") {
> val rows_next = client.fetchResults(
>   operationHandle,
>   FetchOrientation.FETCH_NEXT,
>   1000,
>   FetchType.QUERY_OUTPUT)
> rows_next.numRows()
>   }
> }
>   }
> {code}
> Run it then got ClassNotFound error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode

2019-10-16 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952804#comment-16952804
 ] 

angerszhu edited comment on SPARK-29492 at 10/16/19 1:10 PM:
-

raise a pr soon
Conennect by pyhive will use sync mode.


was (Author: angerszhuuu):
raise a pr soon

> SparkThriftServer  can't support jar class as table serde class when 
> executestatement in sync mode
> --
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Add UT in HiveThriftBinaryServerSuit:
> {code}
>   test("jar in sync mode") {
> withCLIServiceClient { client =>
>   val user = System.getProperty("user.name")
>   val sessionHandle = client.openSession(user, "")
>   val confOverlay = new java.util.HashMap[java.lang.String, 
> java.lang.String]
>   val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath
>   Seq(s"ADD JAR $jarFile",
> "CREATE TABLE smallKV(key INT, val STRING)",
> s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
> smallKV")
> .foreach(query => client.executeStatement(sessionHandle, query, 
> confOverlay))
>   client.executeStatement(sessionHandle,
> """CREATE TABLE addJar(key string)
>   |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> """.stripMargin, confOverlay)
>   client.executeStatement(sessionHandle,
> "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", 
> confOverlay)
>   val operationHandle = client.executeStatement(
> sessionHandle,
> "SELECT key FROM addJar",
> confOverlay)
>   // Fetch result first time
>   assertResult(1, "Fetching result first time from next row") {
> val rows_next = client.fetchResults(
>   operationHandle,
>   FetchOrientation.FETCH_NEXT,
>   1000,
>   FetchType.QUERY_OUTPUT)
> rows_next.numRows()
>   }
> }
>   }
> {code}
> Run it then got ClassNotFound error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode

2019-10-16 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29492:
--
Summary: SparkThriftServer  can't support jar class as table serde class 
when executestatement in sync mode  (was: 
HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.)

> SparkThriftServer  can't support jar class as table serde class when 
> executestatement in sync mode
> --
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.

2019-10-16 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952804#comment-16952804
 ] 

angerszhu commented on SPARK-29492:
---

raise a pr soon

> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.
> -
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.

2019-10-16 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29492:
--
Description: HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete 
created table.

> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.
> -
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29492) HiveThriftBinaryServerSuite#withCLIServiceClient didn't delete created table.

2019-10-16 Thread angerszhu (Jira)

angerszhu created SPARK-29492:
-

 Summary: HiveThriftBinaryServerSuite#withCLIServiceClient didn't 
delete created table.
 Key: SPARK-29492
 URL: https://issues.apache.org/jira/browse/SPARK-29492
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-10-11 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949886#comment-16949886
 ] 

angerszhu commented on SPARK-29295:
---

 This probelm seems start from hive 1.2

I test in our env hive-1.1, won't have this problem.

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949050#comment-16949050
 ] 

angerszhu commented on SPARK-29354:
---

[~Elixir Kook] 

i download spark-2.4.4-bin-hadoop2.7  in your link and I can found 
jline-2.14.6.jar in jars/ folder

 

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948410#comment-16948410
 ] 

angerszhu commented on SPARK-29354:
---

[~Elixir Kook] [~yumwang]

Jline is brought by hive-beeline module , you can find dependency in 
hive-beeline's pom file. 

And build liek below :
{code:java}
./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn 
-Phadoop-provided -Phive-provided
{code}
You won't get a jline jar in dist/jar/ folder

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948399#comment-16948399
 ] 

angerszhu commented on SPARK-29424:
---

[~srowen]

Since resource limit is  established, these bad behavior will cause program run 
very slow but don't know why. 

Make it abort early is better for user to recognize where the problem is. 
Especially for Spark Thrift Server.

> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So I add a constraint when submit tasks and abort stage early when TaskSet 
> size num is bigger then set limit . I wonder if the community will accept 
> this way.
> cc [~srowen] [~dongjoon] [~yumwang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29409) spark drop partition always throws Exception

2019-10-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948368#comment-16948368
 ] 

angerszhu commented on SPARK-29409:
---

Thanks, I will check this problem.

> spark drop partition always throws Exception
> 
>
> Key: SPARK-29409
> URL: https://issues.apache.org/jira/browse/SPARK-29409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark 2.4.0 on yarn 2.7.3
> spark-sql client mode
> run hive version: 2.1.1
> hive builtin version 1.2.1
>Reporter: ant_nebula
>Priority: Major
>
> The table is:
> {code:java}
> CREATE TABLE `test_spark.test_drop_partition`(
>  `platform` string,
>  `product` string,
>  `cnt` bigint)
> PARTITIONED BY (dt string)
> stored as orc;{code}
> hive 2.1.1:
> {code:java}
> spark-sql -e "alter table test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> hive builtin:
> {code:java}
> spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf 
> spark.sql.hive.metastore.jars=builtin -e "alter table 
> test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> both would log Exception:
> {code:java}
> 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 19/10/09 18:21:27 INFO metastore: Connected to metastore.
> 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.transport.TTransportException: Cannot write to null 
> outputStream
>  at 
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)

[jira] [Commented] (SPARK-29288) Spark SQL add jar can't support HTTP path.

2019-10-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948364#comment-16948364
 ] 

angerszhu commented on SPARK-29288:
---

[~dongjoon]

Sorry for later reply , the hive Jira is 
https://issues.apache.org/jira/browse/HIVE-9664

 

> Spark SQL add jar can't support HTTP path. 
> ---
>
> Key: SPARK-29288
> URL: https://issues.apache.org/jira/browse/SPARK-29288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> SparkSQL 
> `ADD JAR` can't support url with http, livy schema , do we need to support it?
> cc [~sro...@scient.com] 
> [~hyukjin.kwon][~dongjoon][~jerryshao][~juliuszsompolski]
> Hive 2.3 support it, do we need to support it? 
> I can work on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29424:
--
Description: 
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So I add a constraint when submit tasks and abort stage early when TaskSet size 
num is bigger then set limit . I wonder if the community will accept this way.
cc [~srowen] [~dongjoon] [~yumwang]

  was:
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So I add a constraint when submit tasks and abort stage early when TaskSet size 
num is bigger then set limit . I wonder if the community will accept this way.
cc [~srowen]


> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So I add a constraint when submit tasks and abort stage early when TaskSet 
> size num is bigger then set limit . I wonder if the community will accept 
> this way.
> cc [~srowen] [~dongjoon] [~yumwang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29424:
--
Description: 
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So I add a constraint when submit tasks and abort stage early when TaskSet size 
num is bigger then set limit . I wonder if the community will accept this way.
cc [~srowen]

  was:
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So i add a constraint when submit tasks.I wonder if the community will accept it


> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So I add a constraint when submit tasks and abort stage early when TaskSet 
> size num is bigger then set limit . I wonder if the community will accept 
> this way.
> cc [~srowen]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29424:
--
Description: 
Our user always submit bad SQL in query platform, Such as :
# write wrong join condition but submit that sql
# write wrong where condition
# etc..

 This case will make Spark scheduler to submit a lot of task. It will cause 
spark run very slow and impact other user(spark thrift server)  even run out of 
memory because of too many object generated by a big num of  tasks. 

So i add a constraint when submit tasks.I wonder if the community will accept it

> Prevent Spark to committing stage of too much Task
> --
>
> Key: SPARK-29424
> URL: https://issues.apache.org/jira/browse/SPARK-29424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Our user always submit bad SQL in query platform, Such as :
> # write wrong join condition but submit that sql
> # write wrong where condition
> # etc..
>  This case will make Spark scheduler to submit a lot of task. It will cause 
> spark run very slow and impact other user(spark thrift server)  even run out 
> of memory because of too many object generated by a big num of  tasks. 
> So i add a constraint when submit tasks.I wonder if the community will accept 
> it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29424) Prevent Spark to committing stage of too much Task

2019-10-10 Thread angerszhu (Jira)

angerszhu created SPARK-29424:
-

 Summary: Prevent Spark to committing stage of too much Task
 Key: SPARK-29424
 URL: https://issues.apache.org/jira/browse/SPARK-29424
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29354) Spark has direct dependency on jline, but binaries for 'without hadoop' don't have a jline jar file.

2019-10-10 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948247#comment-16948247
 ] 

angerszhu commented on SPARK-29354:
---

I will check on this.

> Spark has direct dependency on jline, but  binaries for 'without hadoop' 
> don't have a jline jar file.
> -
>
> Key: SPARK-29354
> URL: https://issues.apache.org/jira/browse/SPARK-29354
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.4, 2.4.4
> Environment: From spark 2.3.x, spark 2.4.x
>Reporter: Sungpeo Kook
>Priority: Minor
>
> Spark has direct dependency on jline, included in the root pom.xml
> but binaries for 'without hadoop' don't have a jline jar file.
>  
> spark 2.2.x has the jline jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29409) spark drop partition always throws Exception

2019-10-09 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948145#comment-16948145
 ] 

angerszhu commented on SPARK-29409:
---

Hive build in version and run version.

> spark drop partition always throws Exception
> 
>
> Key: SPARK-29409
> URL: https://issues.apache.org/jira/browse/SPARK-29409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ant_nebula
>Priority: Major
>
> The table is:
> {code:java}
> CREATE TABLE `test_spark.test_drop_partition`(
>  `platform` string,
>  `product` string,
>  `cnt` bigint)
> PARTITIONED BY (dt string)
> stored as orc;{code}
> hive 2.1.1:
> {code:java}
> spark-sql -e "alter table test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> hive builtin:
> {code:java}
> spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf 
> spark.sql.hive.metastore.jars=builtin -e "alter table 
> test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> both would log Exception:
> {code:java}
> 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 19/10/09 18:21:27 INFO metastore: Connected to metastore.
> 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.transport.TTransportException: Cannot write to null 
> outputStream
>  at 
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
>

[jira] [Comment Edited] (SPARK-29409) spark drop partition always throws Exception

2019-10-09 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947767#comment-16947767
 ] 

angerszhu edited comment on SPARK-29409 at 10/9/19 2:54 PM:


Can you show more reproduce process?


was (Author: angerszhuuu):
i will check on this.

> spark drop partition always throws Exception
> 
>
> Key: SPARK-29409
> URL: https://issues.apache.org/jira/browse/SPARK-29409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ant_nebula
>Priority: Major
>
> The table is:
> {code:java}
> CREATE TABLE `test_spark.test_drop_partition`(
>  `platform` string,
>  `product` string,
>  `cnt` bigint)
> PARTITIONED BY (dt string)
> stored as orc;{code}
> hive 2.1.1:
> {code:java}
> spark-sql -e "alter table test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> hive builtin:
> {code:java}
> spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf 
> spark.sql.hive.metastore.jars=builtin -e "alter table 
> test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> both would log Exception:
> {code:java}
> 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 19/10/09 18:21:27 INFO metastore: Connected to metastore.
> 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.transport.TTransportException: Cannot write to null 
> outputStream
>  at 
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
>

[jira] [Commented] (SPARK-29409) spark drop partition always throws Exception

2019-10-09 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947767#comment-16947767
 ] 

angerszhu commented on SPARK-29409:
---

i will check on this.

> spark drop partition always throws Exception
> 
>
> Key: SPARK-29409
> URL: https://issues.apache.org/jira/browse/SPARK-29409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ant_nebula
>Priority: Major
>
> The table is:
> {code:java}
> CREATE TABLE `test_spark.test_drop_partition`(
>  `platform` string,
>  `product` string,
>  `cnt` bigint)
> PARTITIONED BY (dt string)
> stored as orc;{code}
> hive 2.1.1:
> {code:java}
> spark-sql -e "alter table test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> hive builtin:
> {code:java}
> spark-sql --conf spark.sql.hive.metastore.version=1.2.1 --conf 
> spark.sql.hive.metastore.jars=builtin -e "alter table 
> test_spark.test_drop_partition drop if exists 
> partition(dt='2019-10-08')"{code}
> both would log Exception:
> {code:java}
> 19/10/09 18:21:27 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 19/10/09 18:21:27 INFO metastore: Connected to metastore.
> 19/10/09 18:21:27 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.transport.TTransportException: Cannot write to null 
> outputStream
>  at 
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70)
>  at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_partitions_ps_with_auth(ThriftHiveMetastore.java:2433)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_ps_with_auth(ThriftHiveMetastore.java:2420)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsWithAuthInfo(HiveMetaStoreClient.java:1199)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2265)
>  at com.sun.proxy.$Proxy30.listPartitionsWithAuthInfo(Unknown Source)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2333)
>  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2359)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:560)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1$$anonfun$16.apply(HiveClientImpl.scala:555)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply$mcV$sp(HiveClientImpl.scala:555)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$dropPartitions$1.apply(HiveClientImpl.scala:550)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
>

[jira] [Issue Comment Deleted] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'

2019-10-08 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29379:
--
Comment: was deleted

(was: Don't need to add new expression class. 

If we just add code in ShowFunctionsCommand, we should change a lot UT about 
functions:

{code:java}
case class ShowFunctionsCommand(
db: Option[String],
pattern: Option[String],
showUserFunctions: Boolean,
showSystemFunctions: Boolean) extends RunnableCommand {

  override val output: Seq[Attribute] = {
val schema = StructType(StructField("function", StringType, nullable = 
false) :: Nil)
schema.toAttributes
  }

  override def run(sparkSession: SparkSession): Seq[Row] = {
val dbName = 
db.getOrElse(sparkSession.sessionState.catalog.getCurrentDatabase)
// If pattern is not specified, we use '*', which is used to
// match any sequence of characters (including no characters).
val functionNames =
  sparkSession.sessionState.catalog
.listFunctions(dbName, pattern.getOrElse("*"))
.collect {
  case (f, "USER") if showUserFunctions => f.unquotedString
  case (f, "SYSTEM") if showSystemFunctions => f.unquotedString
}
(functionNames ++ Seq("!=", "<>", "between", "case")).sorted.map(Row(_))
  }
}
{code}
)

> SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
> 
>
> Key: SPARK-29379
> URL: https://issues.apache.org/jira/browse/SPARK-29379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'

2019-10-08 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946526#comment-16946526
 ] 

angerszhu commented on SPARK-29379:
---

Don't need to add new expression class. 

If we just add code in ShowFunctionsCommand, we should change a lot UT about 
functions:

{code:java}
case class ShowFunctionsCommand(
db: Option[String],
pattern: Option[String],
showUserFunctions: Boolean,
showSystemFunctions: Boolean) extends RunnableCommand {

  override val output: Seq[Attribute] = {
val schema = StructType(StructField("function", StringType, nullable = 
false) :: Nil)
schema.toAttributes
  }

  override def run(sparkSession: SparkSession): Seq[Row] = {
val dbName = 
db.getOrElse(sparkSession.sessionState.catalog.getCurrentDatabase)
// If pattern is not specified, we use '*', which is used to
// match any sequence of characters (including no characters).
val functionNames =
  sparkSession.sessionState.catalog
.listFunctions(dbName, pattern.getOrElse("*"))
.collect {
  case (f, "USER") if showUserFunctions => f.unquotedString
  case (f, "SYSTEM") if showSystemFunctions => f.unquotedString
}
(functionNames ++ Seq("!=", "<>", "between", "case")).sorted.map(Row(_))
  }
}
{code}


> SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
> 
>
> Key: SPARK-29379
> URL: https://issues.apache.org/jira/browse/SPARK-29379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'

2019-10-07 Thread angerszhu (Jira)

angerszhu created SPARK-29379:
-

 Summary: SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
 Key: SPARK-29379
 URL: https://issues.apache.org/jira/browse/SPARK-29379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu


SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29288) Spark SQL add jar can't support HTTP path.

2019-09-30 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941487#comment-16941487
 ] 

angerszhu commented on SPARK-29288:
---

[~dongjoon] Sorry for my mistake when report this issue. 
By the way, should we support ivy when add jar and file ?

> Spark SQL add jar can't support HTTP path. 
> ---
>
> Key: SPARK-29288
> URL: https://issues.apache.org/jira/browse/SPARK-29288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> SparkSQL 
> `ADD JAR` can't support url with http, livy schema , do we need to support it?
> cc [~sro...@scient.com] 
> [~hyukjin.kwon][~dongjoon][~jerryshao][~juliuszsompolski]
> Hive 2.3 support it, do we need to support it? 
> I can work on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29288) Spark SQL add jar can't support HTTP path.

2019-09-30 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-29288.
---
Resolution: Not A Problem

> Spark SQL add jar can't support HTTP path. 
> ---
>
> Key: SPARK-29288
> URL: https://issues.apache.org/jira/browse/SPARK-29288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> SparkSQL 
> `ADD JAR` can't support url with http, livy schema , do we need to support it?
> cc [~sro...@scient.com] 
> [~hyukjin.kwon][~dongjoon][~jerryshao][~juliuszsompolski]
> Hive 2.3 support it, do we need to support it? 
> I can work on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29308) dev/deps/spark-deps-hadoop-3.2 orc jar is incorrect

2019-09-30 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29308:
--
Description: 
In hadoop 3.2, orc.classfier is empty.
https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/pom.xml#L2924

here is incorrect.

https://github.com/apache/spark/blob/101839054276bfd52fdc29a98ffbf8e5c0383426/dev/deps/spark-deps-hadoop-3.2#L181-L182

  was:
In hadoop 3.2, orc.classfier is empty.
https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/pom.xml#L2924

here is incorrect.


> dev/deps/spark-deps-hadoop-3.2  orc jar is incorrect
> 
>
> Key: SPARK-29308
> URL: https://issues.apache.org/jira/browse/SPARK-29308
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> In hadoop 3.2, orc.classfier is empty.
> https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/pom.xml#L2924
> here is incorrect.
> https://github.com/apache/spark/blob/101839054276bfd52fdc29a98ffbf8e5c0383426/dev/deps/spark-deps-hadoop-3.2#L181-L182



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 3 4 5 6 7 8 9 10 >

701 - 800 of 957 matches

Mail list logo