date:20200429

[jira] [Commented] (SPARK-28424) Support typed interval expression

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096158#comment-17096158
 ] 

Apache Spark commented on SPARK-28424:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/28418

> Support typed interval expression
> -
>
> Key: SPARK-28424
> URL: https://issues.apache.org/jira/browse/SPARK-28424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Example:
> {code:sql}
> INTERVAL '1 day 2:03:04'
> {code}
> https://www.postgresql.org/docs/11/datatype-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2020-04-29 Thread liuzhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhang resolved SPARK-31576.
--
Resolution: Fixed

override def quoteIdentifier(colName: String): String = s"$colName"

> Unable to return Hive data into Spark via Hive JDBC driver Caused by:  
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED
> 
>
> Key: SPARK-31576
> URL: https://issues.apache.org/jira/browse/SPARK-31576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.3.1
> Environment: hdp 3.0,hadoop 3.1.1，spark 2.3.1
>Reporter: liuzhang
>Priority: Major
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
> Unfortunately, when I try to query data that resides in every column I get 
> the following error:
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 
> Invalid table alias or column reference 'test.aname': (possible column names 
> are: aname, score, banji)
>  at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)
> 1)  On Hive create a simple table,its name is "test"，it have three 
> column(aname,score,banji),their type both are "String"
> 2)important code:
> object HiveDialect extends JdbcDialect
> { override def canHandle(url: String): Boolean = 
> url.startsWith("jdbc:hive2")|| url.contains("hive2")                          
>                                                                               
>             override def quoteIdentifier(colName: String): String = 
> s"`$colName`" }
> ---
> object callOffRun {
>  def main(args: Array[String]): Unit =
> { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
> JdbcDialects.registerDialect(HiveDialect)                                     
>                                             val props = new Properties()      
>                                                                               
>                       props.put("driver","org.apache.hive.jdbc.HiveDriver")   
>                                props.put("user","username")                   
>                                       props.put("password","password")        
>                                                    
> props.put("fetchsize","20")                                                   
>                                                   val table=spark.read 
> .jdbc("jdbc:hive2://:1","test",props)                          
> table.show() }
> }
> 3)spark-submit ,After running,it have error
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 
> Invalid table alias or column reference 'test.aname': (possible column names 
> are: aname, score, banji)
> 4)table.count() have result 
> 5) I try some method to print result,They all reported the same error
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2020-04-29 Thread liuzhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhang closed SPARK-31576.


> Unable to return Hive data into Spark via Hive JDBC driver Caused by:  
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED
> 
>
> Key: SPARK-31576
> URL: https://issues.apache.org/jira/browse/SPARK-31576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.3.1
> Environment: hdp 3.0,hadoop 3.1.1，spark 2.3.1
>Reporter: liuzhang
>Priority: Major
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
> Unfortunately, when I try to query data that resides in every column I get 
> the following error:
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 
> Invalid table alias or column reference 'test.aname': (possible column names 
> are: aname, score, banji)
>  at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)
> 1)  On Hive create a simple table,its name is "test"，it have three 
> column(aname,score,banji),their type both are "String"
> 2)important code:
> object HiveDialect extends JdbcDialect
> { override def canHandle(url: String): Boolean = 
> url.startsWith("jdbc:hive2")|| url.contains("hive2")                          
>                                                                               
>             override def quoteIdentifier(colName: String): String = 
> s"`$colName`" }
> ---
> object callOffRun {
>  def main(args: Array[String]): Unit =
> { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
> JdbcDialects.registerDialect(HiveDialect)                                     
>                                             val props = new Properties()      
>                                                                               
>                       props.put("driver","org.apache.hive.jdbc.HiveDriver")   
>                                props.put("user","username")                   
>                                       props.put("password","password")        
>                                                    
> props.put("fetchsize","20")                                                   
>                                                   val table=spark.read 
> .jdbc("jdbc:hive2://:1","test",props)                          
> table.show() }
> }
> 3)spark-submit ,After running,it have error
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 
> Invalid table alias or column reference 'test.aname': (possible column names 
> are: aname, score, banji)
> 4)table.count() have result 
> 5) I try some method to print result,They all reported the same error
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31614) Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2020-04-29 Thread liuzhang (Jira)

liuzhang created SPARK-31614:


 Summary: Unable to write  data into hive table using Spark via 
Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error 
while compiling statement: FAILED
 Key: SPARK-31614
 URL: https://issues.apache.org/jira/browse/SPARK-31614
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit
Affects Versions: 2.3.1
 Environment: HDP3.0,spark 2.3.1，hadoop 3.1.1
Reporter: liuzhang


I'm trying to wrire data into hive table using a JDBC connection to Hive. 
Unfortunately, when I write data that resides in every column I get the 
following error:

org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: 
FAILED: ParseException line 1:36 cannot recognize input near '.' 'aname' 'TEXT' 
in column type
 at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:255)
 at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:241)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect {
 override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")
 override def quoteIdentifier(colName: String): String = s"$colName"
 }

---

object callOffRun {
 def main(args: Array[String]): Unit = {

val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

JdbcDialects.registerDialect(HiveDialect)
 val props = new Properties()
 props.put("driver","org.apache.hive.jdbc.HiveDriver")
 props.put("user","username")
 props.put("password","password")
 props.put("fetchsize","20")

val table=spark.read.jdbc("jdbc:hive2://:1","test",props)

table.write.jdbc("jdbc:hive2://:1", "resulttable", props)

}

}

3)spark-submit ,After running,When table write，it have error 

4)table.count() have result 

5) i try some method to write data into table,They all reported the same error

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20732) Copy cache data when node is being shut down

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20732:


Assignee: Apache Spark  (was: Prakhar Jain)

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work

2020-04-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31601.
---
Fix Version/s: 3.0.0
   2.4.6
 Assignee: Dongjoon Hyun
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/28401

> Fix spark.kubernetes.executor.podNamePrefix to work
> ---
>
> Key: SPARK-31601
> URL: https://issues.apache.org/jira/browse/SPARK-31601
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20732) Copy cache data when node is being shut down

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096138#comment-17096138
 ] 

Apache Spark commented on SPARK-20732:
--

User 'prakharjain09' has created a pull request for this issue:
https://github.com/apache/spark/pull/28370

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Prakhar Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20732) Copy cache data when node is being shut down

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20732:


Assignee: Prakhar Jain  (was: Apache Spark)

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Prakhar Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31613) How can i run spark3.0.0.preview2 version spark on Spark 2.4 CDH cluster?

2020-04-29 Thread Vutukuri Sathvik (Jira)

Vutukuri Sathvik created SPARK-31613:


 Summary: How can i run spark3.0.0.preview2 version spark on Spark 
2.4 CDH cluster?
 Key: SPARK-31613
 URL: https://issues.apache.org/jira/browse/SPARK-31613
 Project: Spark
  Issue Type: Question
  Components: Spark Submit
Affects Versions: 3.0.0
Reporter: Vutukuri Sathvik


I am trying to run spark-submit with spark 3.0.0 dependencies bundled FAT jar 
on Spark 2.4 CDH cluster but still Spark 2.4 version is running.

How can i by pass spark 2.4 with spark 3.0.0 while doing spark-submit?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31612) SQL Reference clean up

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096124#comment-17096124
 ] 

Apache Spark commented on SPARK-31612:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28417

> SQL Reference clean up
> --
>
> Key: SPARK-31612
> URL: https://issues.apache.org/jira/browse/SPARK-31612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> SQL Reference clean up



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31612) SQL Reference clean up

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096122#comment-17096122
 ] 

Apache Spark commented on SPARK-31612:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28417

> SQL Reference clean up
> --
>
> Key: SPARK-31612
> URL: https://issues.apache.org/jira/browse/SPARK-31612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> SQL Reference clean up



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31612) SQL Reference clean up

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31612:


Assignee: (was: Apache Spark)

> SQL Reference clean up
> --
>
> Key: SPARK-31612
> URL: https://issues.apache.org/jira/browse/SPARK-31612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> SQL Reference clean up



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31612) SQL Reference clean up

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31612:


Assignee: Apache Spark

> SQL Reference clean up
> --
>
> Key: SPARK-31612
> URL: https://issues.apache.org/jira/browse/SPARK-31612
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> SQL Reference clean up



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31557) Legacy parser incorrectly interprets pre-Gregorian dates

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096116#comment-17096116
 ] 

Apache Spark commented on SPARK-31557:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28408

> Legacy parser incorrectly interprets pre-Gregorian dates
> 
>
> Key: SPARK-31557
> URL: https://issues.apache.org/jira/browse/SPARK-31557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.0.0
>
>
> With CSV:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
> "1800-01-01").map(x => s"$x,$x")
> seq: Seq[String] = List(0002-01-01,0002-01-01, 1000-01-01,1000-01-01, 
> 1500-01-01,1500-01-01, 1800-01-01,1800-01-01)
> scala> val ds = seq.toDF("value").as[String]
> ds: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> spark.read.schema("expected STRING, actual DATE").csv(ds).show
> +--+--+
> |  expected|actual|
> +--+--+
> |0002-01-01|0001-12-30|
> |1000-01-01|1000-01-06|
> |1500-01-01|1500-01-10|
> |1800-01-01|1800-01-01|
> +--+--+
> scala> 
> {noformat}
> Similarly, with JSON:
> {noformat}
> scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", 
> "1800-01-01").map { x =>
>   s"""{"expected": "$x", "actual": "$x"}"""
> }
>  |  | seq: Seq[String] = List({"expected": "0002-01-01", "actual": 
> "0002-01-01"}, {"expected": "1000-01-01", "actual": "1000-01-01"}, 
> {"expected": "1500-01-01", "actual": "1500-01-01"}, {"expected": 
> "1800-01-01", "actual": "1800-01-01"})
> scala> 
> scala> val ds = seq.toDF("value").as[String]
> ds: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> spark.read.schema("expected STRING, actual DATE").json(ds).show
> +--+--+
> |  expected|actual|
> +--+--+
> |0002-01-01|0001-12-30|
> |1000-01-01|1000-01-06|
> |1500-01-01|1500-01-10|
> |1800-01-01|1800-01-01|
> +--+--+
> scala> 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31612) SQL Reference clean up

2020-04-29 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-31612:
--

 Summary: SQL Reference clean up
 Key: SPARK-31612
 URL: https://issues.apache.org/jira/browse/SPARK-31612
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


SQL Reference clean up



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31127) Add abstract Selector

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31127:


Assignee: (was: Apache Spark)

> Add abstract Selector
> -
>
> Key: SPARK-31127
> URL: https://issues.apache.org/jira/browse/SPARK-31127
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add abstract Selector. Put the common code between ChisqSelector and 
> FValueSelector to Selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31127) Add abstract Selector

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31127:


Assignee: Apache Spark

> Add abstract Selector
> -
>
> Key: SPARK-31127
> URL: https://issues.apache.org/jira/browse/SPARK-31127
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add abstract Selector. Put the common code between ChisqSelector and 
> FValueSelector to Selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31372) Display expression schema for double checkout alias

2020-04-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31372.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28194
[https://github.com/apache/spark/pull/28194]

> Display expression schema for double checkout alias
> ---
>
> Key: SPARK-31372
> URL: https://issues.apache.org/jira/browse/SPARK-31372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> Although SPARK-30184 Implement a helper method for aliasing functions, 
> developers always forget to using this improvement.
> We need to add more powerful guarantees so that aliases outputed by built-in 
> functions are correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31372) Display expression schema for double checkout alias

2020-04-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31372:
---

Assignee: jiaan.geng

> Display expression schema for double checkout alias
> ---
>
> Key: SPARK-31372
> URL: https://issues.apache.org/jira/browse/SPARK-31372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Although SPARK-30184 Implement a helper method for aliasing functions, 
> developers always forget to using this improvement.
> We need to add more powerful guarantees so that aliases outputed by built-in 
> functions are correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31611:


Assignee: (was: Apache Spark)

> Register NettyMemoryMetrics into Node Manager's metrics system
> --
>
> Key: SPARK-31611
> URL: https://issues.apache.org/jira/browse/SPARK-31611
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31611:


Assignee: Apache Spark

> Register NettyMemoryMetrics into Node Manager's metrics system
> --
>
> Key: SPARK-31611
> URL: https://issues.apache.org/jira/browse/SPARK-31611
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096090#comment-17096090
 ] 

Apache Spark commented on SPARK-31611:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28416

> Register NettyMemoryMetrics into Node Manager's metrics system
> --
>
> Key: SPARK-31611
> URL: https://issues.apache.org/jira/browse/SPARK-31611
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31595:


Assignee: Apache Spark

> Spark sql cli should allow unescaped quote mark in quoted string
> 
>
> Key: SPARK-31595
> URL: https://issues.apache.org/jira/browse/SPARK-31595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Adrian Wang
>Assignee: Apache Spark
>Priority: Major
>
> spark-sql> select "'";
> spark-sql> select '"';
> In Spark parser if we pass a text of `select "'";`, there will be 
> ParserCancellationException, which will be handled by PredictionMode.LL. By 
> dropping `;` correctly we can avoid that retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31595:


Assignee: (was: Apache Spark)

> Spark sql cli should allow unescaped quote mark in quoted string
> 
>
> Key: SPARK-31595
> URL: https://issues.apache.org/jira/browse/SPARK-31595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Adrian Wang
>Priority: Major
>
> spark-sql> select "'";
> spark-sql> select '"';
> In Spark parser if we pass a text of `select "'";`, there will be 
> ParserCancellationException, which will be handled by PredictionMode.LL. By 
> dropping `;` correctly we can avoid that retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31608:


Assignee: (was: Apache Spark)

> Add a hybrid KVStore to make UI loading faster
> --
>
> Key: SPARK-31608
> URL: https://issues.apache.org/jira/browse/SPARK-31608
> Project: Spark
>  Issue Type: Story
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Priority: Major
>
> This is a follow-up for the work done by Hieu Huynh in 2019.
> Add a new class HybridKVStore to make the history server faster when loading 
> event files. When writing to this kvstore, it will first write to an 
> in-memory store and having a background thread that keeps pushing the change 
> to levelDB.
> I ran some tests on 3.0.1 on mac os:
> ||kvstore type / log size||100m||200m||500m||1g||2g||
> |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
> leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
> leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
> leveldb|
> |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31608:


Assignee: Apache Spark

> Add a hybrid KVStore to make UI loading faster
> --
>
> Key: SPARK-31608
> URL: https://issues.apache.org/jira/browse/SPARK-31608
> Project: Spark
>  Issue Type: Story
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Assignee: Apache Spark
>Priority: Major
>
> This is a follow-up for the work done by Hieu Huynh in 2019.
> Add a new class HybridKVStore to make the history server faster when loading 
> event files. When writing to this kvstore, it will first write to an 
> in-memory store and having a background thread that keeps pushing the change 
> to levelDB.
> I ran some tests on 3.0.1 on mac os:
> ||kvstore type / log size||100m||200m||500m||1g||2g||
> |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
> leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
> leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
> leveldb|
> |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system

2020-04-29 Thread Manu Zhang (Jira)

Manu Zhang created SPARK-31611:
--

 Summary: Register NettyMemoryMetrics into Node Manager's metrics 
system
 Key: SPARK-31611
 URL: https://issues.apache.org/jira/browse/SPARK-31611
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, YARN
Affects Versions: 3.0.0
Reporter: Manu Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096068#comment-17096068
 ] 

Apache Spark commented on SPARK-31553:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28405

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096067#comment-17096067
 ] 

Apache Spark commented on SPARK-31553:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28405

> Wrong result of isInCollection for large collections
> 
>
> Key: SPARK-31553
> URL: https://issues.apache.org/jira/browse/SPARK-31553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> If the size of a collection passed to isInCollection is bigger than 
> spark.sql.optimizer.inSetConversionThreshold, the method can return wrong 
> results for some inputs. For example:
> {code:scala}
> val set = (0 to 20).map(_.toString).toSet
> val data = Seq("1").toDF("x")
> println(set.contains("1"))
> data.select($"x".isInCollection(set).as("isInCollection")).show()
> {code}
> {code}
> true
> +--+
> |isInCollection|
> +--+
> | false|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31449:
---

Assignee: Maxim Gekk

> Investigate the difference between JDK and Spark's time zone offset 
> calculation
> ---
>
> Key: SPARK-31449
> URL: https://issues.apache.org/jira/browse/SPARK-31449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Spark 2.4 calculates time zone offsets from wall clock timestamp using 
> `DateTimeUtils.getOffsetFromLocalMillis()` (see 
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
> {code:scala}
>   private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
> Long = {
> var guess = tz.getRawOffset
> // the actual offset should be calculated based on milliseconds in UTC
> val offset = tz.getOffset(millisLocal - guess)
> if (offset != guess) {
>   guess = tz.getOffset(millisLocal - offset)
>   if (guess != offset) {
> // fallback to do the reverse lookup using java.sql.Timestamp
> // this should only happen near the start or end of DST
> val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> val year = getYear(days)
> val month = getMonth(days)
> val day = getDayOfMonth(days)
> var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
> if (millisOfDay < 0) {
>   millisOfDay += MILLIS_PER_DAY.toInt
> }
> val seconds = (millisOfDay / 1000L).toInt
> val hh = seconds / 3600
> val mm = seconds / 60 % 60
> val ss = seconds % 60
> val ms = millisOfDay % 1000
> val calendar = Calendar.getInstance(tz)
> calendar.set(year, month - 1, day, hh, mm, ss)
> calendar.set(Calendar.MILLISECOND, ms)
> guess = (millisLocal - calendar.getTimeInMillis()).toInt
>   }
> }
> guess
>   }
> {code}
> Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
> {code:java}
> if (zone instanceof ZoneInfo) {
> ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
> } else {
> int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
> internalGet(ZONE_OFFSET) : 
> zone.getRawOffset();
> zone.getOffsets(millis - gmtOffset, zoneOffsets);
> }
> {code}
> Need to investigate are there any differences in results between 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31372) Display expression schema for double checkout alias

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31372:


Assignee: Apache Spark

> Display expression schema for double checkout alias
> ---
>
> Key: SPARK-31372
> URL: https://issues.apache.org/jira/browse/SPARK-31372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Although SPARK-30184 Implement a helper method for aliasing functions, 
> developers always forget to using this improvement.
> We need to add more powerful guarantees so that aliases outputed by built-in 
> functions are correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096059#comment-17096059
 ] 

Wenchen Fan edited comment on SPARK-31602 at 4/30/20, 3:15 AM:
---

The cache is a soft-reference map, which should not cause OOM?


was (Author: cloud_fan):
It's a soft-reference map, which should not cause OOM?

> memory leak of JobConf
> --
>
> Key: SPARK-31602
> URL: https://issues.apache.org/jira/browse/SPARK-31602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2020-04-29-14-34-39-496.png, 
> image-2020-04-29-14-35-55-986.png
>
>
> !image-2020-04-29-14-34-39-496.png!
> !image-2020-04-29-14-35-55-986.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31372) Display expression schema for double checkout alias

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31372:


Assignee: (was: Apache Spark)

> Display expression schema for double checkout alias
> ---
>
> Key: SPARK-31372
> URL: https://issues.apache.org/jira/browse/SPARK-31372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Although SPARK-30184 Implement a helper method for aliasing functions, 
> developers always forget to using this improvement.
> We need to add more powerful guarantees so that aliases outputed by built-in 
> functions are correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096060#comment-17096060
 ] 

Apache Spark commented on SPARK-31030:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/28415

> Backward Compatibility for Parsing and Formatting Datetime
> --
>
> Key: SPARK-31030
> URL: https://issues.apache.org/jira/browse/SPARK-31030
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2020-03-04-10-54-05-208.png, 
> image-2020-03-04-10-54-13-238.png
>
>
> *Background*
> In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
> are performed by using the hybrid calendar ([Julian + 
> Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
>  
> Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as 
> well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by 
> using Java 8 API classes (the java.time packages that are based on [ISO 
> chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]
>  ).
> The switching job is completed in SPARK-26651. 
>  
> *Problem*
> Switching to Java 8 datetime API breaks the backward compatibility of Spark 
> 2.4 and earlier when parsing datetime. Spark need its own patters definition 
> on datetime parsing and formatting.
>  
> *Solution*
> To avoid unexpected result changes after the underlying datetime API switch, 
> we propose the following solution. 
>  * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
> need to detect these behavior differences by falling back to the legacy 
> parser, and fail with a user-friendly error message to tell users what gets 
> changed and how to fix the pattern.
>  * Document the Spark’s datetime patterns: The date-time formatter of Spark 
> is decoupled with the Java patterns. The Spark’s patterns are mainly based on 
> the [Java 7’s 
> pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
>  (for better backward compatibility) with the customized logic (caused by the 
> breaking changes between [Java 
> 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
> and [Java 
> 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
>  pattern string). Below are the customized rules:
> ||Pattern||Java 7||Java 8|| Example||Rule||
> |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
> accept a negative value to represent BC, while y should be used together with 
> G to do the same thing.)|!image-2020-03-04-10-54-05-208.png!  |Substitute ‘u’ 
> to ‘e’ and use Java 8 parser to parse the string. If parsable, return the 
> result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to 
> parse. When it is successfully parsed, throw an exception and ask users to 
> change the pattern strings or turn on the legacy mode; otherwise, return NULL 
> as what Spark 2.4 does.|
> | z| General time zone which also accepts
>  [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. 
> Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png!  |The 
> semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 
> follows the semantics of Java 8. 
>  Use Java 8 to parse the string. If parsable, return the result; otherwise, 
> use the legacy Java 7 parser to parse. When it is successfully parsed, throw 
> an exception and ask users to change the pattern strings or turn on the 
> legacy mode; otherwise, return NULL as what Spark 2.4 does.|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096064#comment-17096064
 ] 

Apache Spark commented on SPARK-31449:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28410

> Investigate the difference between JDK and Spark's time zone offset 
> calculation
> ---
>
> Key: SPARK-31449
> URL: https://issues.apache.org/jira/browse/SPARK-31449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.6
>
>
> Spark 2.4 calculates time zone offsets from wall clock timestamp using 
> `DateTimeUtils.getOffsetFromLocalMillis()` (see 
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
> {code:scala}
>   private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
> Long = {
> var guess = tz.getRawOffset
> // the actual offset should be calculated based on milliseconds in UTC
> val offset = tz.getOffset(millisLocal - guess)
> if (offset != guess) {
>   guess = tz.getOffset(millisLocal - offset)
>   if (guess != offset) {
> // fallback to do the reverse lookup using java.sql.Timestamp
> // this should only happen near the start or end of DST
> val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> val year = getYear(days)
> val month = getMonth(days)
> val day = getDayOfMonth(days)
> var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
> if (millisOfDay < 0) {
>   millisOfDay += MILLIS_PER_DAY.toInt
> }
> val seconds = (millisOfDay / 1000L).toInt
> val hh = seconds / 3600
> val mm = seconds / 60 % 60
> val ss = seconds % 60
> val ms = millisOfDay % 1000
> val calendar = Calendar.getInstance(tz)
> calendar.set(year, month - 1, day, hh, mm, ss)
> calendar.set(Calendar.MILLISECOND, ms)
> guess = (millisLocal - calendar.getTimeInMillis()).toInt
>   }
> }
> guess
>   }
> {code}
> Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
> {code:java}
> if (zone instanceof ZoneInfo) {
> ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
> } else {
> int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
> internalGet(ZONE_OFFSET) : 
> zone.getRawOffset();
> zone.getOffsets(millis - gmtOffset, zoneOffsets);
> }
> {code}
> Need to investigate are there any differences in results between 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation

2020-04-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31449.
-
Fix Version/s: 2.4.6
   Resolution: Fixed

Issue resolved by pull request 28410
[https://github.com/apache/spark/pull/28410]

> Investigate the difference between JDK and Spark's time zone offset 
> calculation
> ---
>
> Key: SPARK-31449
> URL: https://issues.apache.org/jira/browse/SPARK-31449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.6
>
>
> Spark 2.4 calculates time zone offsets from wall clock timestamp using 
> `DateTimeUtils.getOffsetFromLocalMillis()` (see 
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118):
> {code:scala}
>   private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): 
> Long = {
> var guess = tz.getRawOffset
> // the actual offset should be calculated based on milliseconds in UTC
> val offset = tz.getOffset(millisLocal - guess)
> if (offset != guess) {
>   guess = tz.getOffset(millisLocal - offset)
>   if (guess != offset) {
> // fallback to do the reverse lookup using java.sql.Timestamp
> // this should only happen near the start or end of DST
> val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
> val year = getYear(days)
> val month = getMonth(days)
> val day = getDayOfMonth(days)
> var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt
> if (millisOfDay < 0) {
>   millisOfDay += MILLIS_PER_DAY.toInt
> }
> val seconds = (millisOfDay / 1000L).toInt
> val hh = seconds / 3600
> val mm = seconds / 60 % 60
> val ss = seconds % 60
> val ms = millisOfDay % 1000
> val calendar = Calendar.getInstance(tz)
> calendar.set(year, month - 1, day, hh, mm, ss)
> calendar.set(Calendar.MILLISECOND, ms)
> guess = (millisLocal - calendar.getTimeInMillis()).toInt
>   }
> }
> guess
>   }
> {code}
> Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801:
> {code:java}
> if (zone instanceof ZoneInfo) {
> ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets);
> } else {
> int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ?
> internalGet(ZONE_OFFSET) : 
> zone.getRawOffset();
> zone.getOffsets(millis - gmtOffset, zoneOffsets);
> }
> {code}
> Need to investigate are there any differences in results between 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096061#comment-17096061
 ] 

Apache Spark commented on SPARK-31030:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/28415

> Backward Compatibility for Parsing and Formatting Datetime
> --
>
> Key: SPARK-31030
> URL: https://issues.apache.org/jira/browse/SPARK-31030
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2020-03-04-10-54-05-208.png, 
> image-2020-03-04-10-54-13-238.png
>
>
> *Background*
> In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
> are performed by using the hybrid calendar ([Julian + 
> Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
>  
> Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as 
> well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by 
> using Java 8 API classes (the java.time packages that are based on [ISO 
> chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]
>  ).
> The switching job is completed in SPARK-26651. 
>  
> *Problem*
> Switching to Java 8 datetime API breaks the backward compatibility of Spark 
> 2.4 and earlier when parsing datetime. Spark need its own patters definition 
> on datetime parsing and formatting.
>  
> *Solution*
> To avoid unexpected result changes after the underlying datetime API switch, 
> we propose the following solution. 
>  * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
> need to detect these behavior differences by falling back to the legacy 
> parser, and fail with a user-friendly error message to tell users what gets 
> changed and how to fix the pattern.
>  * Document the Spark’s datetime patterns: The date-time formatter of Spark 
> is decoupled with the Java patterns. The Spark’s patterns are mainly based on 
> the [Java 7’s 
> pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
>  (for better backward compatibility) with the customized logic (caused by the 
> breaking changes between [Java 
> 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
> and [Java 
> 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
>  pattern string). Below are the customized rules:
> ||Pattern||Java 7||Java 8|| Example||Rule||
> |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
> accept a negative value to represent BC, while y should be used together with 
> G to do the same thing.)|!image-2020-03-04-10-54-05-208.png!  |Substitute ‘u’ 
> to ‘e’ and use Java 8 parser to parse the string. If parsable, return the 
> result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to 
> parse. When it is successfully parsed, throw an exception and ask users to 
> change the pattern strings or turn on the legacy mode; otherwise, return NULL 
> as what Spark 2.4 does.|
> | z| General time zone which also accepts
>  [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. 
> Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png!  |The 
> semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 
> follows the semantics of Java 8. 
>  Use Java 8 to parse the string. If parsable, return the result; otherwise, 
> use the legacy Java 7 parser to parse. When it is successfully parsed, throw 
> an exception and ask users to change the pattern strings or turn on the 
> legacy mode; otherwise, return NULL as what Spark 2.4 does.|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31602) memory leak of JobConf

2020-04-29 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096059#comment-17096059
 ] 

Wenchen Fan commented on SPARK-31602:
-

It's a soft-reference map, which should not cause OOM?

> memory leak of JobConf
> --
>
> Key: SPARK-31602
> URL: https://issues.apache.org/jira/browse/SPARK-31602
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2020-04-29-14-34-39-496.png, 
> image-2020-04-29-14-35-55-986.png
>
>
> !image-2020-04-29-14-34-39-496.png!
> !image-2020-04-29-14-35-55-986.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31601:


Assignee: (was: Apache Spark)

> Fix spark.kubernetes.executor.podNamePrefix to work
> ---
>
> Key: SPARK-31601
> URL: https://issues.apache.org/jira/browse/SPARK-31601
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31601:


Assignee: Apache Spark

> Fix spark.kubernetes.executor.podNamePrefix to work
> ---
>
> Key: SPARK-31601
> URL: https://issues.apache.org/jira/browse/SPARK-31601
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096054#comment-17096054
 ] 

Apache Spark commented on SPARK-31601:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28401

> Fix spark.kubernetes.executor.podNamePrefix to work
> ---
>
> Key: SPARK-31601
> URL: https://issues.apache.org/jira/browse/SPARK-31601
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8981) Set applicationId and appName in log4j MDC

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8981:
---

Assignee: (was: Apache Spark)

> Set applicationId and appName in log4j MDC
> --
>
> Key: SPARK-8981
> URL: https://issues.apache.org/jira/browse/SPARK-8981
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Paweł Kopiczko
>Priority: Minor
>
> It would be nice to have, because it's good to have logs in one file when 
> using log agents (like logentires) in standalone mode. Also allows 
> configuring rolling file appender without a mess when multiple applications 
> are running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8981) Set applicationId and appName in log4j MDC

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8981:
---

Assignee: Apache Spark

> Set applicationId and appName in log4j MDC
> --
>
> Key: SPARK-8981
> URL: https://issues.apache.org/jira/browse/SPARK-8981
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Paweł Kopiczko
>Assignee: Apache Spark
>Priority: Minor
>
> It would be nice to have, because it's good to have logs in one file when 
> using log agents (like logentires) in standalone mode. Also allows 
> configuring rolling file appender without a mess when multiple applications 
> are running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28070) writeType and writeObject in SparkR should be handled by S3 methods

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096044#comment-17096044
 ] 

Apache Spark commented on SPARK-28070:
--

User 'MichaelChirico' has created a pull request for this issue:
https://github.com/apache/spark/pull/28379

> writeType and writeObject in SparkR should be handled by S3 methods
> ---
>
> Key: SPARK-28070
> URL: https://issues.apache.org/jira/browse/SPARK-28070
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Michael Chirico
>Priority: Major
>
> Corollary of https://issues.apache.org/jira/browse/SPARK-28040
> The way writeType and writeObject are handled now feels a bit hack-ish, would 
> be easier to manage with S3 or S4.
> NB: S3 will require changing the order of arguments to call classes on object 
> (current first argument is con)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28070) writeType and writeObject in SparkR should be handled by S3 methods

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096043#comment-17096043
 ] 

Apache Spark commented on SPARK-28070:
--

User 'MichaelChirico' has created a pull request for this issue:
https://github.com/apache/spark/pull/28379

> writeType and writeObject in SparkR should be handled by S3 methods
> ---
>
> Key: SPARK-28070
> URL: https://issues.apache.org/jira/browse/SPARK-28070
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Michael Chirico
>Priority: Major
>
> Corollary of https://issues.apache.org/jira/browse/SPARK-28040
> The way writeType and writeObject are handled now feels a bit hack-ish, would 
> be easier to manage with S3 or S4.
> NB: S3 will require changing the order of arguments to call classes on object 
> (current first argument is con)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28040) sql() fails to process output of glue::glue_data()

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096042#comment-17096042
 ] 

Apache Spark commented on SPARK-28040:
--

User 'MichaelChirico' has created a pull request for this issue:
https://github.com/apache/spark/pull/28379

> sql() fails to process output of glue::glue_data()
> --
>
> Key: SPARK-28040
> URL: https://issues.apache.org/jira/browse/SPARK-28040
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.3
>Reporter: Michael Chirico
>Priority: Major
>
> {{glue}} package is quite natural for sending parameterized queries to Spark 
> from R. Very similar to Python's {{format}} for strings. Error is as simple as
> {code:java}
> library(glue)
> library(sparkR)
> sparkR.session()
> query = glue_data(list(val = 4), 'select {val}')
> sql(query){code}
> Error in writeType(con, serdeType) : 
>   Unsupported type for serialization glue
> {{sql(as.character(query))}} works as expected but this is a bit awkward / 
> post-hoc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31350) Coalesce bucketed tables for join if applicable

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31350:


Assignee: (was: Apache Spark)

> Coalesce bucketed tables for join if applicable
> ---
>
> Key: SPARK-31350
> URL: https://issues.apache.org/jira/browse/SPARK-31350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> The following example of joining two bucketed tables introduces a full 
> shuffle:
> {code:java}
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
> val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", 
> "k")
> df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
> df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2")
> val t1 = spark.table("t1")
> val t2 = spark.table("t2")
> val joined = t1.join(t2, t1("i") === t2("i"))
> joined.explain(true)
> == Physical Plan ==
> *(5) SortMergeJoin [i#44], [i#50], Inner
> :- *(2) Sort [i#44 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(i#44, 200), true, [id=#105]
> :     +- *(1) Project [i#44, j#45, k#46]
> :        +- *(1) Filter isnotnull(i#44)
> :           +- *(1) ColumnarToRow
> :              +- FileScan parquet default.t1[i#44,j#45,k#46] Batched: true, 
> DataFilters: [isnotnull(i#44)], Format: Parquet, Location: 
> InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], 
> ReadSchema: struct, SelectedBucketsCount: 8 out of 8
> +- *(4) Sort [i#50 ASC NULLS FIRST], false, 0
>    +- Exchange hashpartitioning(i#50, 200), true, [id=#115]
>       +- *(3) Project [i#50, j#51, k#52]
>          +- *(3) Filter isnotnull(i#50)
>             +- *(3) ColumnarToRow
>                +- FileScan parquet default.t2[i#50,j#51,k#52] Batched: true, 
> DataFilters: [isnotnull(i#50)], Format: Parquet, Location: 
> InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], 
> ReadSchema: struct, SelectedBucketsCount: 4 out of 4
> {code}
> But one side can be coalesced to eliminate the shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31350) Coalesce bucketed tables for join if applicable

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31350:


Assignee: Apache Spark

> Coalesce bucketed tables for join if applicable
> ---
>
> Key: SPARK-31350
> URL: https://issues.apache.org/jira/browse/SPARK-31350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> The following example of joining two bucketed tables introduces a full 
> shuffle:
> {code:java}
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
> val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", 
> "k")
> df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
> df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2")
> val t1 = spark.table("t1")
> val t2 = spark.table("t2")
> val joined = t1.join(t2, t1("i") === t2("i"))
> joined.explain(true)
> == Physical Plan ==
> *(5) SortMergeJoin [i#44], [i#50], Inner
> :- *(2) Sort [i#44 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(i#44, 200), true, [id=#105]
> :     +- *(1) Project [i#44, j#45, k#46]
> :        +- *(1) Filter isnotnull(i#44)
> :           +- *(1) ColumnarToRow
> :              +- FileScan parquet default.t1[i#44,j#45,k#46] Batched: true, 
> DataFilters: [isnotnull(i#44)], Format: Parquet, Location: 
> InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], 
> ReadSchema: struct, SelectedBucketsCount: 8 out of 8
> +- *(4) Sort [i#50 ASC NULLS FIRST], false, 0
>    +- Exchange hashpartitioning(i#50, 200), true, [id=#115]
>       +- *(3) Project [i#50, j#51, k#52]
>          +- *(3) Filter isnotnull(i#50)
>             +- *(3) ColumnarToRow
>                +- FileScan parquet default.t2[i#50,j#51,k#52] Batched: true, 
> DataFilters: [isnotnull(i#50)], Format: Parquet, Location: 
> InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], 
> ReadSchema: struct, SelectedBucketsCount: 4 out of 4
> {code}
> But one side can be coalesced to eliminate the shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096032#comment-17096032
 ] 

Apache Spark commented on SPARK-31519:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/28397

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30642) LinearSVC blockify input vectors

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096031#comment-17096031
 ] 

Apache Spark commented on SPARK-30642:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/28349

> LinearSVC blockify input vectors
> 
>
> Key: SPARK-30642
> URL: https://issues.apache.org/jira/browse/SPARK-30642
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30642) LinearSVC blockify input vectors

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30642:


Assignee: zhengruifeng  (was: Apache Spark)

> LinearSVC blockify input vectors
> 
>
> Key: SPARK-30642
> URL: https://issues.apache.org/jira/browse/SPARK-30642
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30642) LinearSVC blockify input vectors

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30642:


Assignee: Apache Spark  (was: zhengruifeng)

> LinearSVC blockify input vectors
> 
>
> Key: SPARK-30642
> URL: https://issues.apache.org/jira/browse/SPARK-30642
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30642) LinearSVC blockify input vectors

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096028#comment-17096028
 ] 

Apache Spark commented on SPARK-30642:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/28349

> LinearSVC blockify input vectors
> 
>
> Key: SPARK-30642
> URL: https://issues.apache.org/jira/browse/SPARK-30642
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096020#comment-17096020
 ] 

Apache Spark commented on SPARK-31586:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28402

> Replace expression TimeSub(l, r) with TimeAdd(l -r)
> ---
>
> Key: SPARK-31586
> URL: https://issues.apache.org/jira/browse/SPARK-31586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.1.0
>
>
> The implementation of TimeSub for the operation of timestamp subtracting 
> interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, 
> -r) since there are equivalent. 
> Suggestion from 
> https://github.com/apache/spark/pull/28310#discussion_r414259239



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096016#comment-17096016
 ] 

Apache Spark commented on SPARK-31586:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28402

> Replace expression TimeSub(l, r) with TimeAdd(l -r)
> ---
>
> Key: SPARK-31586
> URL: https://issues.apache.org/jira/browse/SPARK-31586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.1.0
>
>
> The implementation of TimeSub for the operation of timestamp subtracting 
> interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, 
> -r) since there are equivalent. 
> Suggestion from 
> https://github.com/apache/spark/pull/28310#discussion_r414259239



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096009#comment-17096009
 ] 

Apache Spark commented on SPARK-31549:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/28395

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31549:


Assignee: (was: Apache Spark)

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31549:


Assignee: Apache Spark

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Critical
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly

2020-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096008#comment-17096008
 ] 

Apache Spark commented on SPARK-31549:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/28395

> Pyspark SparkContext.cancelJobGroup do not work correctly
> -
>
> Key: SPARK-31549
> URL: https://issues.apache.org/jira/browse/SPARK-31549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue 
> existing for a long time. This is because of pyspark thread didn't pinned to 
> jvm thread when invoking java side methods, which leads to all pyspark API 
> which related to java local thread variables do not work correctly. 
> (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` 
> and so on.)
> This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode 
> added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two 
> issue:
> * It is disabled by default. We need to set additional environment variable 
> to enable it.
> * There's memory leak issue which haven't been addressed.
> Now there's a series of project like hyperopt-spark, spark-joblib which rely 
> on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it 
> is critical to address this issue and we hope it work under default pyspark 
> mode. An optional approach is implementing methods like 
> `rdd.setGroupAndCollect`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30132) Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses

2020-04-29 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096007#comment-17096007
 ] 

Dongjoon Hyun commented on SPARK-30132:
---

Thank you, [~tisue].

> Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses
> 
>
> Key: SPARK-30132
> URL: https://issues.apache.org/jira/browse/SPARK-30132
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Priority: Minor
>
> A few classes in our test code extend Hadoop's LocalFileSystem. Scala 2.13 
> returns a compile error here - not for the Spark code, but because the Hadoop 
> code (it says) illegally overrides appendFile() with slightly different 
> generic types in its return value. This code is valid Java, evidently, and 
> the code actually doesn't define any generic types, so, I even wonder if it's 
> a scalac bug.
> So far I don't see a workaround for this.
> This only affects the Hadoop 3.2 build, in that it comes up with respect to a 
> method new in Hadoop 3. (There is actually another instance of a similar 
> problem that affects Hadoop 2, but I can see a tiny hack workaround for it).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26199) Long expressions cause mutate to fail

2020-04-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26199.
--
Resolution: Duplicate

> Long expressions cause mutate to fail
> -
>
> Key: SPARK-26199
> URL: https://issues.apache.org/jira/browse/SPARK-26199
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: João Rafael
>Priority: Minor
>
> Calling {{mutate(df, field = expr)}} fails when expr is very long.
> Example:
> {code:R}
> df <- mutate(df, field = ifelse(
> lit(TRUE),
> lit("A"),
> ifelse(
> lit(T),
> lit("BB"),
> lit("C")
> )
> ))
> {code}
> Stack trace:
> {code:R}
> FATAL subscript out of bounds
>   at .handleSimpleError(function (obj) 
> {
> level = sapply(class(obj), sw
>   at FUN(X[[i]], ...)
>   at lapply(seq_along(args), function(i) {
> if (ns[[i]] != "") {
> at lapply(seq_along(args), function(i) {
> if (ns[[i]] != "") {
> at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB
>   at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T
> {code}
> The root cause is in: 
> [DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182]
> When the expression is long {{deparse}} returns multiple lines, causing 
> {{args}} to have more elements than {{ns}}. The solution could be to set 
> {{nlines = 1}} or to collapse the lines together.
> A simple work around exists, by first placing the expression in a variable 
> and using it instead:
> {code:R}
> tmp <- ifelse(
> lit(TRUE),
> lit("A"),
> ifelse(
> lit(T),
> lit("BB"),
> lit("C")
> )
> )
> df <- mutate(df, field = tmp)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26199) Long expressions cause mutate to fail

2020-04-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096006#comment-17096006
 ] 

Hyukjin Kwon commented on SPARK-26199:
--

Thanks, [~michaelchirico]

> Long expressions cause mutate to fail
> -
>
> Key: SPARK-26199
> URL: https://issues.apache.org/jira/browse/SPARK-26199
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: João Rafael
>Priority: Minor
>
> Calling {{mutate(df, field = expr)}} fails when expr is very long.
> Example:
> {code:R}
> df <- mutate(df, field = ifelse(
> lit(TRUE),
> lit("A"),
> ifelse(
> lit(T),
> lit("BB"),
> lit("C")
> )
> ))
> {code}
> Stack trace:
> {code:R}
> FATAL subscript out of bounds
>   at .handleSimpleError(function (obj) 
> {
> level = sapply(class(obj), sw
>   at FUN(X[[i]], ...)
>   at lapply(seq_along(args), function(i) {
> if (ns[[i]] != "") {
> at lapply(seq_along(args), function(i) {
> if (ns[[i]] != "") {
> at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB
>   at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T
> {code}
> The root cause is in: 
> [DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182]
> When the expression is long {{deparse}} returns multiple lines, causing 
> {{args}} to have more elements than {{ns}}. The solution could be to set 
> {{nlines = 1}} or to collapse the lines together.
> A simple work around exists, by first placing the expression in a variable 
> and using it instead:
> {code:R}
> tmp <- ifelse(
> lit(TRUE),
> lit("A"),
> ifelse(
> lit(T),
> lit("BB"),
> lit("C")
> )
> )
> df <- mutate(df, field = tmp)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31610) Expose hashFunc property in HashingTF

2020-04-29 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-31610:
--

 Summary: Expose hashFunc property in HashingTF
 Key: SPARK-31610
 URL: https://issues.apache.org/jira/browse/SPARK-31610
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Weichen Xu


Expose hashFunc property in HashingTF

Some third-party library such as mleap need to access it.
See background description here:
https://github.com/combust/mleap/pull/665#issuecomment-621258623




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation

2020-04-29 Thread Eren Avsarogullari (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095948#comment-17095948
 ] 

Eren Avsarogullari commented on SPARK-31566:


Hi Pablo,

There is ongoing PR on this: https://github.com/apache/spark/pull/28354
Also, It will be updated in the light of 
[https://github.com/apache/spark/pull/28208]

Hope these help. Thanks

> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. 
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- Details for the given query.
> 2- ?details=[true|false (default)] lists metric details in addition to given 
> query details.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30132) Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses

2020-04-29 Thread Seth Tisue (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095936#comment-17095936
 ] 

Seth Tisue commented on SPARK-30132:


Scala 2.13.2 is now out.

> Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses
> 
>
> Key: SPARK-30132
> URL: https://issues.apache.org/jira/browse/SPARK-30132
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Priority: Minor
>
> A few classes in our test code extend Hadoop's LocalFileSystem. Scala 2.13 
> returns a compile error here - not for the Spark code, but because the Hadoop 
> code (it says) illegally overrides appendFile() with slightly different 
> generic types in its return value. This code is valid Java, evidently, and 
> the code actually doesn't define any generic types, so, I even wonder if it's 
> a scalac bug.
> So far I don't see a workaround for this.
> This only affects the Hadoop 3.2 build, in that it comes up with respect to a 
> method new in Hadoop 3. (There is actually another instance of a similar 
> problem that affects Hadoop 2, but I can see a tiny hack workaround for it).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31582) Be able to not populate Hadoop classpath

2020-04-29 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31582:

Fix Version/s: 3.0.0

> Be able to not populate Hadoop classpath
> 
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31582) Being able to not populate Hadoop classpath

2020-04-29 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31582:

Summary: Being able to not populate Hadoop classpath  (was: Be able to not 
populate Hadoop classpath)

> Being able to not populate Hadoop classpath
> ---
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31582) Be able to not populate Hadoop classpath

2020-04-29 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31582.
-
Fix Version/s: 2.4.6
   Resolution: Fixed

Issue resolved by pull request 28376
[https://github.com/apache/spark/pull/28376]

> Be able to not populate Hadoop classpath
> 
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6
>
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31582) Be able to not populate Hadoop classpath

2020-04-29 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31582:
---

Assignee: DB Tsai

> Be able to not populate Hadoop classpath
> 
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation

2020-04-29 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095881#comment-17095881
 ] 

Pablo Langa Blanco commented on SPARK-31566:


I'm taking a look on this. I think is useful to add this documentation

> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. 
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- Details for the given query.
> 2- ?details=[true|false (default)] lists metric details in addition to given 
> query details.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive

2020-04-29 Thread Divya Paliwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divya Paliwal resolved SPARK-31604.
---
Resolution: Won't Do

> java.lang.IllegalArgumentException: Frame length should be positive
> ---
>
> Key: SPARK-31604
> URL: https://issues.apache.org/jira/browse/SPARK-31604
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.4
> Environment: Scala version 2.11.12
> Spark version 2.4.4
>Reporter: Divya Paliwal
>Priority: Major
> Fix For: 2.4.4
>
>
> Hi,
> I am currently facing the below error when I run my code to stream data from 
> Couchbase in spark cluster.
> 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
> connection from /[host]:56910 
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -9223371863711366549
>  at 
> org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
>  at java.lang.Thread.run(Thread.java:748)
>  
> When ever I run the spark-submit command with all the arguments on the host I 
> get the above error.
>   
> Thanks,
> Divya



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-04-29 Thread Baohe Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-31608:

Description: 
This is a follow-up for the work done by Hieu Huynh in 2019.

Add a new class HybridKVStore to make the history server faster when loading 
event files. When writing to this kvstore, it will first write to an in-memory 
store and having a background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|

 

  was:
Add a new class HybridKVStore to make the history server faster when loading 
event files. When writing to this kvstore, it will first write to an in-memory 
store and having a background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|

 


> Add a hybrid KVStore to make UI loading faster
> --
>
> Key: SPARK-31608
> URL: https://issues.apache.org/jira/browse/SPARK-31608
> Project: Spark
>  Issue Type: Story
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Priority: Major
>
> This is a follow-up for the work done by Hieu Huynh in 2019.
> Add a new class HybridKVStore to make the history server faster when loading 
> event files. When writing to this kvstore, it will first write to an 
> in-memory store and having a background thread that keeps pushing the change 
> to levelDB.
> I ran some tests on 3.0.1 on mac os:
> ||kvstore type / log size||100m||200m||500m||1g||2g||
> |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
> leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
> leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
> leveldb|
> |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31609) Add VarianceThresholdSelector to PySpark

2020-04-29 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095749#comment-17095749
 ] 

Huaxin Gao commented on SPARK-31609:


https://github.com/apache/spark/pull/28409

> Add VarianceThresholdSelector to PySpark
> 
>
> Key: SPARK-31609
> URL: https://issues.apache.org/jira/browse/SPARK-31609
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add VarianceThresholdSelector to PySpark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive

2020-04-29 Thread Divya Paliwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divya Paliwal updated SPARK-31604:
--
Description: 
Hi,

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 at java.lang.Thread.run(Thread.java:748)

 

When ever I run the spark-submit command with all the arguments on the host I 
get the above error.

  

Thanks,

Divya

  was:
Hi,

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 at java.lang.Thread.run(Thread.java:748)

 

When ever I run the spark-submit command with all the arguments on the host I 
get the above error.

 

Command run:

bin/spark-submit \

  --deploy-mode cluster \

  --class "com.CouchbaseRawMain" \

  --master [spark://master-host:8091] \

  --jars

[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive

2020-04-29 Thread Divya Paliwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divya Paliwal updated SPARK-31604:
--
Description: 
Hi,

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 at java.lang.Thread.run(Thread.java:748)

 

When ever I run the spark-submit command with all the arguments on the host I 
get the above error.

 

Command run:

bin/spark-submit \

  --deploy-mode cluster \

  --class "com.CouchbaseRawMain" \

  --master [spark://master-host:8091] \

  --jars 
/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/core-io-1.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/java-client-2.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxjava-1.3.8.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/couchbase-spark-connector_2.11-2.4.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/dcp-client-0.23.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/opentracing-api-0.31.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxscala_2.11-0.27.0.jar
 \

  --conf spark.rpc.message.maxSize=1000 —-conf 
spark.shuffle.service.enabled=true —-conf 
spark.network.crypto.saslFallback=true —-conf spark.authenticate=true —-conf 
spark.network.crypto.enabled=true

  --driver-memory 3g --driver-cores 2 --num-executors 1 --executor-memory 3g 
--total-executor-cores 2 --executor-cores 2 \

  /ngs/app/dcf5d/HD-cluster/spark/code/generic-couchbase-raw-0.1.0-SNAPSHOT.jar 
[spark://master-host:8091] Couchbase-DC-Bucket-Raw-QA-OG-28 
host1:8091,host2:8091 DC 
[hdfs://master-host:9000/tables/couchbase/QA/DC-bucket/] Administrator dcadmin 
welcome true

 

Thanks,

Divya

  was:
Hi, 

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at

[jira] [Created] (SPARK-31609) Add VarianceThresholdSelector to PySpark

2020-04-29 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-31609:
--

 Summary: Add VarianceThresholdSelector to PySpark
 Key: SPARK-31609
 URL: https://issues.apache.org/jira/browse/SPARK-31609
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Add VarianceThresholdSelector to PySpark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive

2020-04-29 Thread Divya Paliwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divya Paliwal updated SPARK-31604:
--
Description: 
Hi, 

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 at java.lang.Thread.run(Thread.java:748)

 

When ever I run the spark-submit command with all the arguments on the host I 
get the above error.

 

Command run:

bin/spark-submit \

  --deploy-mode cluster \

  --class "com.pst.dc.CouchbaseRawMain" \

  --master [spark://master-host:8091] \

  --jars 
/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/core-io-1.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/java-client-2.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxjava-1.3.8.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/couchbase-spark-connector_2.11-2.4.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/dcp-client-0.23.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/opentracing-api-0.31.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxscala_2.11-0.27.0.jar
 \

  --conf spark.rpc.message.maxSize=1000 —-conf 
spark.shuffle.service.enabled=true —-conf 
spark.network.crypto.saslFallback=true —-conf spark.authenticate=true —-conf 
spark.network.crypto.enabled=true

  --driver-memory 3g --driver-cores 2 --num-executors 1 --executor-memory 3g 
--total-executor-cores 2 --executor-cores 2 \

  /ngs/app/dcf5d/HD-cluster/spark/code/generic-couchbase-raw-0.1.0-SNAPSHOT.jar 
[spark://master-host:8091] Couchbase-DC-Bucket-Raw-QA-OG-28 
host1:8091,host2:8091 DC 
[hdfs://master-host:9000/tables/couchbase/QA/DC-bucket/] Administrator dcadmin 
welcome true

 

Thanks,

Divya

  was:
Hi, 

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at

[jira] [Created] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-04-29 Thread Baohe Zhang (Jira)

Baohe Zhang created SPARK-31608:
---

 Summary: Add a hybrid KVStore to make UI loading faster
 Key: SPARK-31608
 URL: https://issues.apache.org/jira/browse/SPARK-31608
 Project: Spark
  Issue Type: Story
  Components: Web UI
Affects Versions: 3.0.1
Reporter: Baohe Zhang


Add a new class HybridKVStore to make the history server faster when loading 
event files. When writing to this kvstore, it will first write to an in-memory 
store and having a background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31560) Add V1/V2 tests for TextSuite and WholeTextFileSuite

2020-04-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31560:
--
Affects Version/s: (was: 3.0.1)

> Add V1/V2 tests for TextSuite and WholeTextFileSuite
> 
>
> Key: SPARK-31560
> URL: https://issues.apache.org/jira/browse/SPARK-31560
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-29 Thread Felix Kizhakkel Jose (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095697#comment-17095697
 ] 

Felix Kizhakkel Jose commented on SPARK-31599:
--

Thank you [~gsomogyi]. But this is not a S3 issue. The issue is I have 
compacted files in the bucket and deleted the non compacted files, but didn't 
update/modify the "_spark_metadata" folder. And I could see that those Write 
Ahead Log Json files contain the deleted file names. And when I use Spark SQL 
to read the data, it first reads the Write Ahead logs from "_spark_metadata" 
and then try to read the files listed in it. So I am wondering how can we 
update the "_spark_metadata" content (Write Ahead Logs)?

> Reading from S3 (Structured Streaming Bucket) Fails after Compaction
> 
>
> Key: SPARK-31599
> URL: https://issues.apache.org/jira/browse/SPARK-31599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
> Structured Streaming Framework from Kafka. Periodically I try to run 
> compaction on this bucket (a separate Spark Job), and on successful 
> compaction delete the non compacted (parquet) files. After which I am getting 
> following error on Spark jobs which read from that bucket:
>  *Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*
> How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
> to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095595#comment-17095595
 ] 

Dongjoon Hyun commented on SPARK-31519:
---

For the record, from 2.2.3 ~ 2.4.5, the following queries are used to see the 
wrong result. (2.0.2 and 2.1.3 has no problem with the following queries)
{code}
spark-sql> SELECT SUM(a) AS b, hour('2020-01-01 12:12:12') AS fake FROM VALUES 
(1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10;
Time taken: 3.249 seconds
spark-sql> SELECT SUM(a) AS b, '2020-01-01 12:12:12' AS fake FROM VALUES (1, 
10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10;
2   2020-01-01 12:12:12
Time taken: 0.505 seconds, Fetched 1 row(s)
{code}

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Affects Version/s: 2.2.3

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Priority: Blocker  (was: Major)

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Affects Version/s: 2.3.4

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31607) fix perf regression in CTESubstitution

2020-04-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31607:

Affects Version/s: (was: 3.0.0)
   3.1.0

> fix perf regression in CTESubstitution
> --
>
> Key: SPARK-31607
> URL: https://issues.apache.org/jira/browse/SPARK-31607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31607) fix perf regression in CTESubstitution

2020-04-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31607:

Issue Type: Improvement  (was: Bug)

> fix perf regression in CTESubstitution
> --
>
> Key: SPARK-31607
> URL: https://issues.apache.org/jira/browse/SPARK-31607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31607) Improve the perf of CTESubstitution

2020-04-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31607:

Summary: Improve the perf of CTESubstitution  (was: fix perf regression in 
CTESubstitution)

> Improve the perf of CTESubstitution
> ---
>
> Key: SPARK-31607
> URL: https://issues.apache.org/jira/browse/SPARK-31607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095594#comment-17095594
 ] 

Dongjoon Hyun commented on SPARK-31519:
---

This is backported to branch-2.4 via https://github.com/apache/spark/pull/28397

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Fix Version/s: 2.4.6

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
--
Affects Version/s: 2.4.5

> Cast in having aggregate expressions returns the wrong result
> -
>
> Key: SPARK-31519
> URL: https://issues.apache.org/jira/browse/SPARK-31519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>  | select sum(a) as b, '2020-01-01' as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---+--+
> |  b|  fake|
> +---+--+
> |  2|2020-01-01|
> +---+--+
> scala> val query = """
>  | select sum(a) as b, cast('2020-01-01' as date) as fake
>  | from t
>  | group by b
>  | having b > 10;"""
> scala> spark.sql(query).show()
> +---++
> |  b|fake|
> +---++
> +---++
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive

2020-04-29 Thread Divya Paliwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divya Paliwal updated SPARK-31604:
--
Description: 
Hi, 

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 at java.lang.Thread.run(Thread.java:748)

 

When ever I run the spark-submit command with all the arguments on the host I 
get the above error.

 

Command run:

bin/spark-submit \

  --deploy-mode cluster \

  --class "com.apple.pst.dc.CouchbaseRawMain" \

  --master [spark://master-host:8091] \

  --jars 
/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/core-io-1.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/java-client-2.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxjava-1.3.8.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/couchbase-spark-connector_2.11-2.4.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/dcp-client-0.23.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/opentracing-api-0.31.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxscala_2.11-0.27.0.jar
 \

  --conf spark.rpc.message.maxSize=1000 —-conf 
spark.shuffle.service.enabled=true —-conf 
spark.network.crypto.saslFallback=true —-conf spark.authenticate=true —-conf 
spark.network.crypto.enabled=true

  --driver-memory 3g --driver-cores 2 --num-executors 1 --executor-memory 3g 
--total-executor-cores 2 --executor-cores 2 \

  /ngs/app/dcf5d/HD-cluster/spark/code/generic-couchbase-raw-0.1.0-SNAPSHOT.jar 
[spark://master-host:8091] Couchbase-DC-Bucket-Raw-QA-OG-28 
host1:8091,host2:8091 DC 
[hdfs://master-host:9000/tables/couchbase/QA/DC-bucket/] Administrator dcadmin 
welcome true

 

Thanks,

Divya

  was:
Hi, 

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at

[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive

2020-04-29 Thread Divya Paliwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divya Paliwal updated SPARK-31604:
--
Description: 
Hi, 

I am currently facing the below error when I run my code to stream data from 
Couchbase in spark cluster.

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 at java.lang.Thread.run(Thread.java:748)

 

When ever I run the spark-submit command with all the arguments on the host I 
get the above error.

 

Command run:

bin/spark-submit \

  --deploy-mode cluster \

  --class "com.apple.pst.dc.CouchbaseRawMain" \

  --master [spark://rn-dcf5t-lapp85.rno.apple.com:8091] \

  --jars 
/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/core-io-1.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/java-client-2.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxjava-1.3.8.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/couchbase-spark-connector_2.11-2.4.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/dcp-client-0.23.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/opentracing-api-0.31.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxscala_2.11-0.27.0.jar
 \

  --conf spark.rpc.message.maxSize=1000 —-conf 
spark.shuffle.service.enabled=true —-conf 
spark.network.crypto.saslFallback=true —-conf spark.authenticate=true —-conf 
spark.network.crypto.enabled=true 

  --driver-memory 3g --driver-cores 2 --num-executors 1 --executor-memory 3g 
--total-executor-cores 2 --executor-cores 2 \

  /ngs/app/dcf5d/HD-cluster/spark/code/generic-couchbase-raw-0.1.0-SNAPSHOT.jar 
[spark://rn-dcf5t-lapp85.rno.apple.com:8091] Couchbase-DC-Bucket-Raw-QA-OG-28 
rn-dcf5t-lapp87.rno.apple.com:8091,rn-dcf5t-lapp88.rno.apple.com:8091 DC 
[hdfs://rn-dcf5t-lapp85.rno.apple.com:9000/tables/couchbase/QA/DC-bucket/] 
Administrator dcadmin welcome true

 

Thanks,

Divya

  was:
I am currently facing the below error when i run my code of streaming data from 
couchbase in spark master cluster.

 

2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
connection from /[host]:56910 

java.lang.IllegalArgumentException: Frame length should be positive: 
-9223371863711366549
 at 
org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
 at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 at

[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive

2020-04-29 Thread Divya Paliwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divya Paliwal updated SPARK-31604:
--
Environment: 
Scala version 2.11.12

Spark version 2.4.4

> java.lang.IllegalArgumentException: Frame length should be positive
> ---
>
> Key: SPARK-31604
> URL: https://issues.apache.org/jira/browse/SPARK-31604
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.4
> Environment: Scala version 2.11.12
> Spark version 2.4.4
>Reporter: Divya Paliwal
>Priority: Major
> Fix For: 2.4.4
>
>
> I am currently facing the below error when i run my code of streaming data 
> from couchbase in spark master cluster.
>  
> 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in 
> connection from /[host]:56910 
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -9223371863711366549
>  at 
> org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
>  at java.lang.Thread.run(Thread.java:748)
>  
> When ever I run the spark-submit command with all the arguments on the host I 
> get the above error.
>  
> Thanks,
> Divya



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31607) fix perf regression in CTESubstitution

2020-04-29 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31607:
---

 Summary: fix perf regression in CTESubstitution
 Key: SPARK-31607
 URL: https://issues.apache.org/jira/browse/SPARK-31607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31606) reduce the perf regression of vectorized parquet reader caused by datetime rebase

2020-04-29 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31606:
---

 Summary: reduce the perf regression of vectorized parquet reader 
caused by datetime rebase
 Key: SPARK-31606
 URL: https://issues.apache.org/jira/browse/SPARK-31606
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3

2020-04-29 Thread Amit Ashish (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Ashish updated SPARK-31605:

Description: 
When performing inserting data with dynamic partition, the operation fails if 
all partitions are not dynamic. For example:

 
{code:sql}
create external table test_insert(a int) partitioned by (part_a string, part_b 
string) stored as parquet location '';
 
{code}
The query
{code:sql}
insert into table test_insert partition(part_a='a', part_b) values (3, 'b');
{code}
will fails with errors
{code:xml}
Cannot create partition spec from hdfs:/// ; missing keys [part_a]
Ignoring invalid DP directory 
{code}
 

 

 

On the other hand, if I remove the static value of part_a to make the insert 
fully dynamic, the following query will succeed. Please note that below is not 
the issue . Issue is above one , where query throws invalid DP directory 
warning.
{code:sql}
insert into table test_insert partition(part_a, part_b) values (1,'a','b');
{code}

  was:
When performing inserting data with dynamic partition, the operation fails if 
all partitions are not dynamic. For example:

The query
{code:sql}
insert into table test_insert partition(part_a='a', part_b) values (3, 'b');
{code}
will fails with errors
{code:xml}
Cannot create partition spec from hdfs:/// ; missing keys [part_a]
Ignoring invalid DP directory 
{code}
On the other hand, if I remove the static value of part_a to make the insert 
fully dynamic, the following query will success.
{code:sql}
insert overwrite table t1 (part_a, part_b) select * from t2
{code}


> Unable to insert data with partial dynamic partition with Spark & Hive 3
> 
>
> Key: SPARK-31605
> URL: https://issues.apache.org/jira/browse/SPARK-31605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Hortonwork HDP 3.1.0
> Spark 2.3.2
> Hive 3
>Reporter: Amit Ashish
>Priority: Major
>
> When performing inserting data with dynamic partition, the operation fails if 
> all partitions are not dynamic. For example:
>  
> {code:sql}
> create external table test_insert(a int) partitioned by (part_a string, 
> part_b string) stored as parquet location '';
>  
> {code}
> The query
> {code:sql}
> insert into table test_insert partition(part_a='a', part_b) values (3, 'b');
> {code}
> will fails with errors
> {code:xml}
> Cannot create partition spec from hdfs:/// ; missing keys [part_a]
> Ignoring invalid DP directory 
> {code}
>  
>  
>  
> On the other hand, if I remove the static value of part_a to make the insert 
> fully dynamic, the following query will succeed. Please note that below is 
> not the issue . Issue is above one , where query throws invalid DP directory 
> warning.
> {code:sql}
> insert into table test_insert partition(part_a, part_b) values (1,'a','b');
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3

2020-04-29 Thread Amit Ashish (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095462#comment-17095462
 ] 

Amit Ashish commented on SPARK-31605:
-

previously closed ticket does not show the actual insert statement working.

 

below is the query that is not working:

insert into table test_insert partition(part_a='a', part_b) values (3, 'b');

 

Getting below error:

 

WARN FileOperations: Ignoring invalid DP directory 
hdfs://HDP3/warehouse/tablespace/external/hive/dw_analyst.db/test_insert/.hive-staging_hive_2020-04-29_13-28-46_360_4646016571504464856-1/-ext-1/part_b=b
20/04/29 13:28:52 INFO Hive: Loaded 0 partitions

 

As mentioned in previous ticket , setting below does not make any difference:

 

set hive.exec.dynamic.partition.mode=nonstrict;

 

Neither setting spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict as 
spark config solves this .

 

 

 

 

 

 

> Unable to insert data with partial dynamic partition with Spark & Hive 3
> 
>
> Key: SPARK-31605
> URL: https://issues.apache.org/jira/browse/SPARK-31605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Hortonwork HDP 3.1.0
> Spark 2.3.2
> Hive 3
>Reporter: Amit Ashish
>Priority: Major
>
> When performing inserting data with dynamic partition, the operation fails if 
> all partitions are not dynamic. For example:
> The query
> {code:sql}
> insert overwrite table t1 (part_a='a', part_b) select * from t2
> {code}
> will fails with errors
> {code:xml}
> Cannot create partition spec from hdfs:/// ; missing keys [part_a]
> Ignoring invalid DP directory 
> {code}
> On the other hand, if I remove the static value of part_a to make the insert 
> fully dynamic, the following query will success.
> {code:sql}
> insert overwrite table t1 (part_a, part_b) select * from t2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3

2020-04-29 Thread Amit Ashish (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Ashish updated SPARK-31605:

Description: 
When performing inserting data with dynamic partition, the operation fails if 
all partitions are not dynamic. For example:

The query
{code:sql}
insert into table test_insert partition(part_a='a', part_b) values (3, 'b');
{code}
will fails with errors
{code:xml}
Cannot create partition spec from hdfs:/// ; missing keys [part_a]
Ignoring invalid DP directory 
{code}
On the other hand, if I remove the static value of part_a to make the insert 
fully dynamic, the following query will success.
{code:sql}
insert overwrite table t1 (part_a, part_b) select * from t2
{code}

  was:
When performing inserting data with dynamic partition, the operation fails if 
all partitions are not dynamic. For example:

The query
{code:sql}
insert overwrite table t1 (part_a='a', part_b) select * from t2
{code}
will fails with errors
{code:xml}
Cannot create partition spec from hdfs:/// ; missing keys [part_a]
Ignoring invalid DP directory 
{code}
On the other hand, if I remove the static value of part_a to make the insert 
fully dynamic, the following query will success.
{code:sql}
insert overwrite table t1 (part_a, part_b) select * from t2
{code}


> Unable to insert data with partial dynamic partition with Spark & Hive 3
> 
>
> Key: SPARK-31605
> URL: https://issues.apache.org/jira/browse/SPARK-31605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Hortonwork HDP 3.1.0
> Spark 2.3.2
> Hive 3
>Reporter: Amit Ashish
>Priority: Major
>
> When performing inserting data with dynamic partition, the operation fails if 
> all partitions are not dynamic. For example:
> The query
> {code:sql}
> insert into table test_insert partition(part_a='a', part_b) values (3, 'b');
> {code}
> will fails with errors
> {code:xml}
> Cannot create partition spec from hdfs:/// ; missing keys [part_a]
> Ignoring invalid DP directory 
> {code}
> On the other hand, if I remove the static value of part_a to make the insert 
> fully dynamic, the following query will success.
> {code:sql}
> insert overwrite table t1 (part_a, part_b) select * from t2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3

2020-04-29 Thread Amit Ashish (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095462#comment-17095462
 ] 

Amit Ashish edited comment on SPARK-31605 at 4/29/20, 1:42 PM:
---

previously closed ticket does not show the actual insert statement working.

 

below is the query that is not working:

insert into table test_insert partition(part_a='a', part_b) values (3, 'b');

 

Getting below warning:

 

WARN FileOperations: Ignoring invalid DP directory 
hdfs://HDP3/warehouse/tablespace/external/hive/dw_analyst.db/test_insert/.hive-staging_hive_2020-04-29_13-28-46_360_4646016571504464856-1/-ext-1/part_b=b
 20/04/29 13:28:52 INFO Hive: Loaded 0 partitions

 

As mentioned in previous ticket , setting below does not make any difference:

 

set hive.exec.dynamic.partition.mode=nonstrict;

 

Neither setting spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict as 
spark config solves this .

 

 

Worst part is data does not get inserted and the return code is still 0 . 
Kindly either suggest a fix for this or enable a non-zero return code to track 
this in automated data pipelines .

 

 

 

 

 

 


was (Author: dreamaaj):
previously closed ticket does not show the actual insert statement working.

 

below is the query that is not working:

insert into table test_insert partition(part_a='a', part_b) values (3, 'b');

 

Getting below error:

 

WARN FileOperations: Ignoring invalid DP directory 
hdfs://HDP3/warehouse/tablespace/external/hive/dw_analyst.db/test_insert/.hive-staging_hive_2020-04-29_13-28-46_360_4646016571504464856-1/-ext-1/part_b=b
20/04/29 13:28:52 INFO Hive: Loaded 0 partitions

 

As mentioned in previous ticket , setting below does not make any difference:

 

set hive.exec.dynamic.partition.mode=nonstrict;

 

Neither setting spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict as 
spark config solves this .

 

 

 

 

 

 

> Unable to insert data with partial dynamic partition with Spark & Hive 3
> 
>
> Key: SPARK-31605
> URL: https://issues.apache.org/jira/browse/SPARK-31605
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Hortonwork HDP 3.1.0
> Spark 2.3.2
> Hive 3
>Reporter: Amit Ashish
>Priority: Major
>
> When performing inserting data with dynamic partition, the operation fails if 
> all partitions are not dynamic. For example:
> The query
> {code:sql}
> insert overwrite table t1 (part_a='a', part_b) select * from t2
> {code}
> will fails with errors
> {code:xml}
> Cannot create partition spec from hdfs:/// ; missing keys [part_a]
> Ignoring invalid DP directory 
> {code}
> On the other hand, if I remove the static value of part_a to make the insert 
> fully dynamic, the following query will success.
> {code:sql}
> insert overwrite table t1 (part_a, part_b) select * from t2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3

2020-04-29 Thread Amit Ashish (Jira)

Amit Ashish created SPARK-31605:
---

 Summary: Unable to insert data with partial dynamic partition with 
Spark & Hive 3
 Key: SPARK-31605
 URL: https://issues.apache.org/jira/browse/SPARK-31605
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
 Environment: Hortonwork HDP 3.1.0

Spark 2.3.2

Hive 3
Reporter: Amit Ashish


When performing inserting data with dynamic partition, the operation fails if 
all partitions are not dynamic. For example:

The query
{code:sql}
insert overwrite table t1 (part_a='a', part_b) select * from t2
{code}
will fails with errors
{code:xml}
Cannot create partition spec from hdfs:/// ; missing keys [part_a]
Ignoring invalid DP directory 
{code}
On the other hand, if I remove the static value of part_a to make the insert 
fully dynamic, the following query will success.
{code:sql}
insert overwrite table t1 (part_a, part_b) select * from t2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 127 matches

Mail list logo