[jira] [Commented] (SPARK-28424) Support typed interval expression
[ https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096158#comment-17096158 ] Apache Spark commented on SPARK-28424: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/28418 > Support typed interval expression > - > > Key: SPARK-28424 > URL: https://issues.apache.org/jira/browse/SPARK-28424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Example: > {code:sql} > INTERVAL '1 day 2:03:04' > {code} > https://www.postgresql.org/docs/11/datatype-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
[ https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhang resolved SPARK-31576. -- Resolution: Fixed override def quoteIdentifier(colName: String): String = s"$colName" > Unable to return Hive data into Spark via Hive JDBC driver Caused by: > org.apache.hive.service.cli.HiveSQLException: Error while compiling > statement: FAILED > > > Key: SPARK-31576 > URL: https://issues.apache.org/jira/browse/SPARK-31576 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Affects Versions: 2.3.1 > Environment: hdp 3.0,hadoop 3.1.1,spark 2.3.1 >Reporter: liuzhang >Priority: Major > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in every column I get > the following error: > Caused by: org.apache.hive.service.cli.HiveSQLException: Error while > compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 > Invalid table alias or column reference 'test.aname': (possible column names > are: aname, score, banji) > at > org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) > at > org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) > 1) On Hive create a simple table,its name is "test",it have three > column(aname,score,banji),their type both are "String" > 2)important code: > object HiveDialect extends JdbcDialect > { override def canHandle(url: String): Boolean = > url.startsWith("jdbc:hive2")|| url.contains("hive2") > > override def quoteIdentifier(colName: String): String = > s"`$colName`" } > --- > object callOffRun { > def main(args: Array[String]): Unit = > { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() > JdbcDialects.registerDialect(HiveDialect) > val props = new Properties() > > props.put("driver","org.apache.hive.jdbc.HiveDriver") > props.put("user","username") > props.put("password","password") > > props.put("fetchsize","20") > val table=spark.read > .jdbc("jdbc:hive2://:1","test",props) > table.show() } > } > 3)spark-submit ,After running,it have error > Caused by: org.apache.hive.service.cli.HiveSQLException: Error while > compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 > Invalid table alias or column reference 'test.aname': (possible column names > are: aname, score, banji) > 4)table.count() have result > 5) I try some method to print result,They all reported the same error > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
[ https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhang closed SPARK-31576. > Unable to return Hive data into Spark via Hive JDBC driver Caused by: > org.apache.hive.service.cli.HiveSQLException: Error while compiling > statement: FAILED > > > Key: SPARK-31576 > URL: https://issues.apache.org/jira/browse/SPARK-31576 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Affects Versions: 2.3.1 > Environment: hdp 3.0,hadoop 3.1.1,spark 2.3.1 >Reporter: liuzhang >Priority: Major > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in every column I get > the following error: > Caused by: org.apache.hive.service.cli.HiveSQLException: Error while > compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 > Invalid table alias or column reference 'test.aname': (possible column names > are: aname, score, banji) > at > org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) > at > org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) > 1) On Hive create a simple table,its name is "test",it have three > column(aname,score,banji),their type both are "String" > 2)important code: > object HiveDialect extends JdbcDialect > { override def canHandle(url: String): Boolean = > url.startsWith("jdbc:hive2")|| url.contains("hive2") > > override def quoteIdentifier(colName: String): String = > s"`$colName`" } > --- > object callOffRun { > def main(args: Array[String]): Unit = > { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() > JdbcDialects.registerDialect(HiveDialect) > val props = new Properties() > > props.put("driver","org.apache.hive.jdbc.HiveDriver") > props.put("user","username") > props.put("password","password") > > props.put("fetchsize","20") > val table=spark.read > .jdbc("jdbc:hive2://:1","test",props) > table.show() } > } > 3)spark-submit ,After running,it have error > Caused by: org.apache.hive.service.cli.HiveSQLException: Error while > compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 > Invalid table alias or column reference 'test.aname': (possible column names > are: aname, score, banji) > 4)table.count() have result > 5) I try some method to print result,They all reported the same error > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31614) Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
liuzhang created SPARK-31614: Summary: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED Key: SPARK-31614 URL: https://issues.apache.org/jira/browse/SPARK-31614 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit Affects Versions: 2.3.1 Environment: HDP3.0,spark 2.3.1,hadoop 3.1.1 Reporter: liuzhang I'm trying to wrire data into hive table using a JDBC connection to Hive. Unfortunately, when I write data that resides in every column I get the following error: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:36 cannot recognize input near '.' 'aname' 'TEXT' in column type at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:255) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:241) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"$colName" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read.jdbc("jdbc:hive2://:1","test",props) table.write.jdbc("jdbc:hive2://:1", "resulttable", props) } } 3)spark-submit ,After running,When table write,it have error 4)table.count() have result 5) i try some method to write data into table,They all reported the same error -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20732) Copy cache data when node is being shut down
[ https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20732: Assignee: Apache Spark (was: Prakhar Jain) > Copy cache data when node is being shut down > > > Key: SPARK-20732 > URL: https://issues.apache.org/jira/browse/SPARK-20732 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work
[ https://issues.apache.org/jira/browse/SPARK-31601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31601. --- Fix Version/s: 3.0.0 2.4.6 Assignee: Dongjoon Hyun Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/28401 > Fix spark.kubernetes.executor.podNamePrefix to work > --- > > Key: SPARK-31601 > URL: https://issues.apache.org/jira/browse/SPARK-31601 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.6, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20732) Copy cache data when node is being shut down
[ https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096138#comment-17096138 ] Apache Spark commented on SPARK-20732: -- User 'prakharjain09' has created a pull request for this issue: https://github.com/apache/spark/pull/28370 > Copy cache data when node is being shut down > > > Key: SPARK-20732 > URL: https://issues.apache.org/jira/browse/SPARK-20732 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Assignee: Prakhar Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20732) Copy cache data when node is being shut down
[ https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20732: Assignee: Prakhar Jain (was: Apache Spark) > Copy cache data when node is being shut down > > > Key: SPARK-20732 > URL: https://issues.apache.org/jira/browse/SPARK-20732 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Assignee: Prakhar Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31613) How can i run spark3.0.0.preview2 version spark on Spark 2.4 CDH cluster?
Vutukuri Sathvik created SPARK-31613: Summary: How can i run spark3.0.0.preview2 version spark on Spark 2.4 CDH cluster? Key: SPARK-31613 URL: https://issues.apache.org/jira/browse/SPARK-31613 Project: Spark Issue Type: Question Components: Spark Submit Affects Versions: 3.0.0 Reporter: Vutukuri Sathvik I am trying to run spark-submit with spark 3.0.0 dependencies bundled FAT jar on Spark 2.4 CDH cluster but still Spark 2.4 version is running. How can i by pass spark 2.4 with spark 3.0.0 while doing spark-submit? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31612) SQL Reference clean up
[ https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096124#comment-17096124 ] Apache Spark commented on SPARK-31612: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28417 > SQL Reference clean up > -- > > Key: SPARK-31612 > URL: https://issues.apache.org/jira/browse/SPARK-31612 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > SQL Reference clean up -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31612) SQL Reference clean up
[ https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096122#comment-17096122 ] Apache Spark commented on SPARK-31612: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28417 > SQL Reference clean up > -- > > Key: SPARK-31612 > URL: https://issues.apache.org/jira/browse/SPARK-31612 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > SQL Reference clean up -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31612) SQL Reference clean up
[ https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31612: Assignee: (was: Apache Spark) > SQL Reference clean up > -- > > Key: SPARK-31612 > URL: https://issues.apache.org/jira/browse/SPARK-31612 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > SQL Reference clean up -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31612) SQL Reference clean up
[ https://issues.apache.org/jira/browse/SPARK-31612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31612: Assignee: Apache Spark > SQL Reference clean up > -- > > Key: SPARK-31612 > URL: https://issues.apache.org/jira/browse/SPARK-31612 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > SQL Reference clean up -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31557) Legacy parser incorrectly interprets pre-Gregorian dates
[ https://issues.apache.org/jira/browse/SPARK-31557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096116#comment-17096116 ] Apache Spark commented on SPARK-31557: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28408 > Legacy parser incorrectly interprets pre-Gregorian dates > > > Key: SPARK-31557 > URL: https://issues.apache.org/jira/browse/SPARK-31557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 3.0.0 > > > With CSV: > {noformat} > scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY") > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", > "1800-01-01").map(x => s"$x,$x") > seq: Seq[String] = List(0002-01-01,0002-01-01, 1000-01-01,1000-01-01, > 1500-01-01,1500-01-01, 1800-01-01,1800-01-01) > scala> val ds = seq.toDF("value").as[String] > ds: org.apache.spark.sql.Dataset[String] = [value: string] > scala> spark.read.schema("expected STRING, actual DATE").csv(ds).show > +--+--+ > | expected|actual| > +--+--+ > |0002-01-01|0001-12-30| > |1000-01-01|1000-01-06| > |1500-01-01|1500-01-10| > |1800-01-01|1800-01-01| > +--+--+ > scala> > {noformat} > Similarly, with JSON: > {noformat} > scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY") > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> val seq = Seq("0002-01-01", "1000-01-01", "1500-01-01", > "1800-01-01").map { x => > s"""{"expected": "$x", "actual": "$x"}""" > } > | | seq: Seq[String] = List({"expected": "0002-01-01", "actual": > "0002-01-01"}, {"expected": "1000-01-01", "actual": "1000-01-01"}, > {"expected": "1500-01-01", "actual": "1500-01-01"}, {"expected": > "1800-01-01", "actual": "1800-01-01"}) > scala> > scala> val ds = seq.toDF("value").as[String] > ds: org.apache.spark.sql.Dataset[String] = [value: string] > scala> spark.read.schema("expected STRING, actual DATE").json(ds).show > +--+--+ > | expected|actual| > +--+--+ > |0002-01-01|0001-12-30| > |1000-01-01|1000-01-06| > |1500-01-01|1500-01-10| > |1800-01-01|1800-01-01| > +--+--+ > scala> > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31612) SQL Reference clean up
Huaxin Gao created SPARK-31612: -- Summary: SQL Reference clean up Key: SPARK-31612 URL: https://issues.apache.org/jira/browse/SPARK-31612 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao SQL Reference clean up -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31127) Add abstract Selector
[ https://issues.apache.org/jira/browse/SPARK-31127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31127: Assignee: (was: Apache Spark) > Add abstract Selector > - > > Key: SPARK-31127 > URL: https://issues.apache.org/jira/browse/SPARK-31127 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add abstract Selector. Put the common code between ChisqSelector and > FValueSelector to Selector. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31127) Add abstract Selector
[ https://issues.apache.org/jira/browse/SPARK-31127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31127: Assignee: Apache Spark > Add abstract Selector > - > > Key: SPARK-31127 > URL: https://issues.apache.org/jira/browse/SPARK-31127 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Add abstract Selector. Put the common code between ChisqSelector and > FValueSelector to Selector. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31372) Display expression schema for double checkout alias
[ https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31372. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28194 [https://github.com/apache/spark/pull/28194] > Display expression schema for double checkout alias > --- > > Key: SPARK-31372 > URL: https://issues.apache.org/jira/browse/SPARK-31372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.0.0 > > > Although SPARK-30184 Implement a helper method for aliasing functions, > developers always forget to using this improvement. > We need to add more powerful guarantees so that aliases outputed by built-in > functions are correct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31372) Display expression schema for double checkout alias
[ https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31372: --- Assignee: jiaan.geng > Display expression schema for double checkout alias > --- > > Key: SPARK-31372 > URL: https://issues.apache.org/jira/browse/SPARK-31372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Although SPARK-30184 Implement a helper method for aliasing functions, > developers always forget to using this improvement. > We need to add more powerful guarantees so that aliases outputed by built-in > functions are correct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system
[ https://issues.apache.org/jira/browse/SPARK-31611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31611: Assignee: (was: Apache Spark) > Register NettyMemoryMetrics into Node Manager's metrics system > -- > > Key: SPARK-31611 > URL: https://issues.apache.org/jira/browse/SPARK-31611 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system
[ https://issues.apache.org/jira/browse/SPARK-31611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31611: Assignee: Apache Spark > Register NettyMemoryMetrics into Node Manager's metrics system > -- > > Key: SPARK-31611 > URL: https://issues.apache.org/jira/browse/SPARK-31611 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system
[ https://issues.apache.org/jira/browse/SPARK-31611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096090#comment-17096090 ] Apache Spark commented on SPARK-31611: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/28416 > Register NettyMemoryMetrics into Node Manager's metrics system > -- > > Key: SPARK-31611 > URL: https://issues.apache.org/jira/browse/SPARK-31611 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string
[ https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31595: Assignee: Apache Spark > Spark sql cli should allow unescaped quote mark in quoted string > > > Key: SPARK-31595 > URL: https://issues.apache.org/jira/browse/SPARK-31595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Adrian Wang >Assignee: Apache Spark >Priority: Major > > spark-sql> select "'"; > spark-sql> select '"'; > In Spark parser if we pass a text of `select "'";`, there will be > ParserCancellationException, which will be handled by PredictionMode.LL. By > dropping `;` correctly we can avoid that retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string
[ https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31595: Assignee: (was: Apache Spark) > Spark sql cli should allow unescaped quote mark in quoted string > > > Key: SPARK-31595 > URL: https://issues.apache.org/jira/browse/SPARK-31595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Adrian Wang >Priority: Major > > spark-sql> select "'"; > spark-sql> select '"'; > In Spark parser if we pass a text of `select "'";`, there will be > ParserCancellationException, which will be handled by PredictionMode.LL. By > dropping `;` correctly we can avoid that retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31608) Add a hybrid KVStore to make UI loading faster
[ https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31608: Assignee: (was: Apache Spark) > Add a hybrid KVStore to make UI loading faster > -- > > Key: SPARK-31608 > URL: https://issues.apache.org/jira/browse/SPARK-31608 > Project: Spark > Issue Type: Story > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Priority: Major > > This is a follow-up for the work done by Hieu Huynh in 2019. > Add a new class HybridKVStore to make the history server faster when loading > event files. When writing to this kvstore, it will first write to an > in-memory store and having a background thread that keeps pushing the change > to levelDB. > I ran some tests on 3.0.1 on mac os: > ||kvstore type / log size||100m||200m||500m||1g||2g|| > |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to > leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to > leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to > leveldb| > |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31608) Add a hybrid KVStore to make UI loading faster
[ https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31608: Assignee: Apache Spark > Add a hybrid KVStore to make UI loading faster > -- > > Key: SPARK-31608 > URL: https://issues.apache.org/jira/browse/SPARK-31608 > Project: Spark > Issue Type: Story > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Assignee: Apache Spark >Priority: Major > > This is a follow-up for the work done by Hieu Huynh in 2019. > Add a new class HybridKVStore to make the history server faster when loading > event files. When writing to this kvstore, it will first write to an > in-memory store and having a background thread that keeps pushing the change > to levelDB. > I ran some tests on 3.0.1 on mac os: > ||kvstore type / log size||100m||200m||500m||1g||2g|| > |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to > leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to > leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to > leveldb| > |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system
Manu Zhang created SPARK-31611: -- Summary: Register NettyMemoryMetrics into Node Manager's metrics system Key: SPARK-31611 URL: https://issues.apache.org/jira/browse/SPARK-31611 Project: Spark Issue Type: Improvement Components: Shuffle, YARN Affects Versions: 3.0.0 Reporter: Manu Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections
[ https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096068#comment-17096068 ] Apache Spark commented on SPARK-31553: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28405 > Wrong result of isInCollection for large collections > > > Key: SPARK-31553 > URL: https://issues.apache.org/jira/browse/SPARK-31553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > If the size of a collection passed to isInCollection is bigger than > spark.sql.optimizer.inSetConversionThreshold, the method can return wrong > results for some inputs. For example: > {code:scala} > val set = (0 to 20).map(_.toString).toSet > val data = Seq("1").toDF("x") > println(set.contains("1")) > data.select($"x".isInCollection(set).as("isInCollection")).show() > {code} > {code} > true > +--+ > |isInCollection| > +--+ > | false| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31553) Wrong result of isInCollection for large collections
[ https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096067#comment-17096067 ] Apache Spark commented on SPARK-31553: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28405 > Wrong result of isInCollection for large collections > > > Key: SPARK-31553 > URL: https://issues.apache.org/jira/browse/SPARK-31553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > If the size of a collection passed to isInCollection is bigger than > spark.sql.optimizer.inSetConversionThreshold, the method can return wrong > results for some inputs. For example: > {code:scala} > val set = (0 to 20).map(_.toString).toSet > val data = Seq("1").toDF("x") > println(set.contains("1")) > data.select($"x".isInCollection(set).as("isInCollection")).show() > {code} > {code} > true > +--+ > |isInCollection| > +--+ > | false| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation
[ https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31449: --- Assignee: Maxim Gekk > Investigate the difference between JDK and Spark's time zone offset > calculation > --- > > Key: SPARK-31449 > URL: https://issues.apache.org/jira/browse/SPARK-31449 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Spark 2.4 calculates time zone offsets from wall clock timestamp using > `DateTimeUtils.getOffsetFromLocalMillis()` (see > https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118): > {code:scala} > private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): > Long = { > var guess = tz.getRawOffset > // the actual offset should be calculated based on milliseconds in UTC > val offset = tz.getOffset(millisLocal - guess) > if (offset != guess) { > guess = tz.getOffset(millisLocal - offset) > if (guess != offset) { > // fallback to do the reverse lookup using java.sql.Timestamp > // this should only happen near the start or end of DST > val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > val year = getYear(days) > val month = getMonth(days) > val day = getDayOfMonth(days) > var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt > if (millisOfDay < 0) { > millisOfDay += MILLIS_PER_DAY.toInt > } > val seconds = (millisOfDay / 1000L).toInt > val hh = seconds / 3600 > val mm = seconds / 60 % 60 > val ss = seconds % 60 > val ms = millisOfDay % 1000 > val calendar = Calendar.getInstance(tz) > calendar.set(year, month - 1, day, hh, mm, ss) > calendar.set(Calendar.MILLISECOND, ms) > guess = (millisLocal - calendar.getTimeInMillis()).toInt > } > } > guess > } > {code} > Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see > https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801: > {code:java} > if (zone instanceof ZoneInfo) { > ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets); > } else { > int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ? > internalGet(ZONE_OFFSET) : > zone.getRawOffset(); > zone.getOffsets(millis - gmtOffset, zoneOffsets); > } > {code} > Need to investigate are there any differences in results between 2 approaches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31372) Display expression schema for double checkout alias
[ https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31372: Assignee: Apache Spark > Display expression schema for double checkout alias > --- > > Key: SPARK-31372 > URL: https://issues.apache.org/jira/browse/SPARK-31372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Although SPARK-30184 Implement a helper method for aliasing functions, > developers always forget to using this improvement. > We need to add more powerful guarantees so that aliases outputed by built-in > functions are correct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096059#comment-17096059 ] Wenchen Fan edited comment on SPARK-31602 at 4/30/20, 3:15 AM: --- The cache is a soft-reference map, which should not cause OOM? was (Author: cloud_fan): It's a soft-reference map, which should not cause OOM? > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > !image-2020-04-29-14-35-55-986.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31372) Display expression schema for double checkout alias
[ https://issues.apache.org/jira/browse/SPARK-31372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31372: Assignee: (was: Apache Spark) > Display expression schema for double checkout alias > --- > > Key: SPARK-31372 > URL: https://issues.apache.org/jira/browse/SPARK-31372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Although SPARK-30184 Implement a helper method for aliasing functions, > developers always forget to using this improvement. > We need to add more powerful guarantees so that aliases outputed by built-in > functions are correct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime
[ https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096060#comment-17096060 ] Apache Spark commented on SPARK-31030: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/28415 > Backward Compatibility for Parsing and Formatting Datetime > -- > > Key: SPARK-31030 > URL: https://issues.apache.org/jira/browse/SPARK-31030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2020-03-04-10-54-05-208.png, > image-2020-03-04-10-54-13-238.png > > > *Background* > In Spark version 2.4 and earlier, datetime parsing, formatting and conversion > are performed by using the hybrid calendar ([Julian + > Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]). > > Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as > well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by > using Java 8 API classes (the java.time packages that are based on [ISO > chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html] > ). > The switching job is completed in SPARK-26651. > > *Problem* > Switching to Java 8 datetime API breaks the backward compatibility of Spark > 2.4 and earlier when parsing datetime. Spark need its own patters definition > on datetime parsing and formatting. > > *Solution* > To avoid unexpected result changes after the underlying datetime API switch, > we propose the following solution. > * Introduce the fallback mechanism: when the Java 8-based parser fails, we > need to detect these behavior differences by falling back to the legacy > parser, and fail with a user-friendly error message to tell users what gets > changed and how to fix the pattern. > * Document the Spark’s datetime patterns: The date-time formatter of Spark > is decoupled with the Java patterns. The Spark’s patterns are mainly based on > the [Java 7’s > pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] > (for better backward compatibility) with the customized logic (caused by the > breaking changes between [Java > 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] > and [Java > 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] > pattern string). Below are the customized rules: > ||Pattern||Java 7||Java 8|| Example||Rule|| > |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u > accept a negative value to represent BC, while y should be used together with > G to do the same thing.)|!image-2020-03-04-10-54-05-208.png! |Substitute ‘u’ > to ‘e’ and use Java 8 parser to parse the string. If parsable, return the > result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to > parse. When it is successfully parsed, throw an exception and ask users to > change the pattern strings or turn on the legacy mode; otherwise, return NULL > as what Spark 2.4 does.| > | z| General time zone which also accepts > [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. > Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png! |The > semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 > follows the semantics of Java 8. > Use Java 8 to parse the string. If parsable, return the result; otherwise, > use the legacy Java 7 parser to parse. When it is successfully parsed, throw > an exception and ask users to change the pattern strings or turn on the > legacy mode; otherwise, return NULL as what Spark 2.4 does.| > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation
[ https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096064#comment-17096064 ] Apache Spark commented on SPARK-31449: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28410 > Investigate the difference between JDK and Spark's time zone offset > calculation > --- > > Key: SPARK-31449 > URL: https://issues.apache.org/jira/browse/SPARK-31449 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.6 > > > Spark 2.4 calculates time zone offsets from wall clock timestamp using > `DateTimeUtils.getOffsetFromLocalMillis()` (see > https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118): > {code:scala} > private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): > Long = { > var guess = tz.getRawOffset > // the actual offset should be calculated based on milliseconds in UTC > val offset = tz.getOffset(millisLocal - guess) > if (offset != guess) { > guess = tz.getOffset(millisLocal - offset) > if (guess != offset) { > // fallback to do the reverse lookup using java.sql.Timestamp > // this should only happen near the start or end of DST > val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > val year = getYear(days) > val month = getMonth(days) > val day = getDayOfMonth(days) > var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt > if (millisOfDay < 0) { > millisOfDay += MILLIS_PER_DAY.toInt > } > val seconds = (millisOfDay / 1000L).toInt > val hh = seconds / 3600 > val mm = seconds / 60 % 60 > val ss = seconds % 60 > val ms = millisOfDay % 1000 > val calendar = Calendar.getInstance(tz) > calendar.set(year, month - 1, day, hh, mm, ss) > calendar.set(Calendar.MILLISECOND, ms) > guess = (millisLocal - calendar.getTimeInMillis()).toInt > } > } > guess > } > {code} > Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see > https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801: > {code:java} > if (zone instanceof ZoneInfo) { > ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets); > } else { > int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ? > internalGet(ZONE_OFFSET) : > zone.getRawOffset(); > zone.getOffsets(millis - gmtOffset, zoneOffsets); > } > {code} > Need to investigate are there any differences in results between 2 approaches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31449) Investigate the difference between JDK and Spark's time zone offset calculation
[ https://issues.apache.org/jira/browse/SPARK-31449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31449. - Fix Version/s: 2.4.6 Resolution: Fixed Issue resolved by pull request 28410 [https://github.com/apache/spark/pull/28410] > Investigate the difference between JDK and Spark's time zone offset > calculation > --- > > Key: SPARK-31449 > URL: https://issues.apache.org/jira/browse/SPARK-31449 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.6 > > > Spark 2.4 calculates time zone offsets from wall clock timestamp using > `DateTimeUtils.getOffsetFromLocalMillis()` (see > https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1088-L1118): > {code:scala} > private[sql] def getOffsetFromLocalMillis(millisLocal: Long, tz: TimeZone): > Long = { > var guess = tz.getRawOffset > // the actual offset should be calculated based on milliseconds in UTC > val offset = tz.getOffset(millisLocal - guess) > if (offset != guess) { > guess = tz.getOffset(millisLocal - offset) > if (guess != offset) { > // fallback to do the reverse lookup using java.sql.Timestamp > // this should only happen near the start or end of DST > val days = Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > val year = getYear(days) > val month = getMonth(days) > val day = getDayOfMonth(days) > var millisOfDay = (millisLocal % MILLIS_PER_DAY).toInt > if (millisOfDay < 0) { > millisOfDay += MILLIS_PER_DAY.toInt > } > val seconds = (millisOfDay / 1000L).toInt > val hh = seconds / 3600 > val mm = seconds / 60 % 60 > val ss = seconds % 60 > val ms = millisOfDay % 1000 > val calendar = Calendar.getInstance(tz) > calendar.set(year, month - 1, day, hh, mm, ss) > calendar.set(Calendar.MILLISECOND, ms) > guess = (millisLocal - calendar.getTimeInMillis()).toInt > } > } > guess > } > {code} > Meanwhile, JDK's GregorianCalendar uses special methods of ZoneInfo, see > https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801: > {code:java} > if (zone instanceof ZoneInfo) { > ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets); > } else { > int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ? > internalGet(ZONE_OFFSET) : > zone.getRawOffset(); > zone.getOffsets(millis - gmtOffset, zoneOffsets); > } > {code} > Need to investigate are there any differences in results between 2 approaches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime
[ https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096061#comment-17096061 ] Apache Spark commented on SPARK-31030: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/28415 > Backward Compatibility for Parsing and Formatting Datetime > -- > > Key: SPARK-31030 > URL: https://issues.apache.org/jira/browse/SPARK-31030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2020-03-04-10-54-05-208.png, > image-2020-03-04-10-54-13-238.png > > > *Background* > In Spark version 2.4 and earlier, datetime parsing, formatting and conversion > are performed by using the hybrid calendar ([Julian + > Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]). > > Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as > well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by > using Java 8 API classes (the java.time packages that are based on [ISO > chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html] > ). > The switching job is completed in SPARK-26651. > > *Problem* > Switching to Java 8 datetime API breaks the backward compatibility of Spark > 2.4 and earlier when parsing datetime. Spark need its own patters definition > on datetime parsing and formatting. > > *Solution* > To avoid unexpected result changes after the underlying datetime API switch, > we propose the following solution. > * Introduce the fallback mechanism: when the Java 8-based parser fails, we > need to detect these behavior differences by falling back to the legacy > parser, and fail with a user-friendly error message to tell users what gets > changed and how to fix the pattern. > * Document the Spark’s datetime patterns: The date-time formatter of Spark > is decoupled with the Java patterns. The Spark’s patterns are mainly based on > the [Java 7’s > pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] > (for better backward compatibility) with the customized logic (caused by the > breaking changes between [Java > 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] > and [Java > 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] > pattern string). Below are the customized rules: > ||Pattern||Java 7||Java 8|| Example||Rule|| > |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u > accept a negative value to represent BC, while y should be used together with > G to do the same thing.)|!image-2020-03-04-10-54-05-208.png! |Substitute ‘u’ > to ‘e’ and use Java 8 parser to parse the string. If parsable, return the > result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to > parse. When it is successfully parsed, throw an exception and ask users to > change the pattern strings or turn on the legacy mode; otherwise, return NULL > as what Spark 2.4 does.| > | z| General time zone which also accepts > [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. > Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png! |The > semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 > follows the semantics of Java 8. > Use Java 8 to parse the string. If parsable, return the result; otherwise, > use the legacy Java 7 parser to parse. When it is successfully parsed, throw > an exception and ask users to change the pattern strings or turn on the > legacy mode; otherwise, return NULL as what Spark 2.4 does.| > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096059#comment-17096059 ] Wenchen Fan commented on SPARK-31602: - It's a soft-reference map, which should not cause OOM? > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > !image-2020-04-29-14-35-55-986.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work
[ https://issues.apache.org/jira/browse/SPARK-31601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31601: Assignee: (was: Apache Spark) > Fix spark.kubernetes.executor.podNamePrefix to work > --- > > Key: SPARK-31601 > URL: https://issues.apache.org/jira/browse/SPARK-31601 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work
[ https://issues.apache.org/jira/browse/SPARK-31601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31601: Assignee: Apache Spark > Fix spark.kubernetes.executor.podNamePrefix to work > --- > > Key: SPARK-31601 > URL: https://issues.apache.org/jira/browse/SPARK-31601 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work
[ https://issues.apache.org/jira/browse/SPARK-31601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096054#comment-17096054 ] Apache Spark commented on SPARK-31601: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28401 > Fix spark.kubernetes.executor.podNamePrefix to work > --- > > Key: SPARK-31601 > URL: https://issues.apache.org/jira/browse/SPARK-31601 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8981: --- Assignee: (was: Apache Spark) > Set applicationId and appName in log4j MDC > -- > > Key: SPARK-8981 > URL: https://issues.apache.org/jira/browse/SPARK-8981 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Paweł Kopiczko >Priority: Minor > > It would be nice to have, because it's good to have logs in one file when > using log agents (like logentires) in standalone mode. Also allows > configuring rolling file appender without a mess when multiple applications > are running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8981: --- Assignee: Apache Spark > Set applicationId and appName in log4j MDC > -- > > Key: SPARK-8981 > URL: https://issues.apache.org/jira/browse/SPARK-8981 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Paweł Kopiczko >Assignee: Apache Spark >Priority: Minor > > It would be nice to have, because it's good to have logs in one file when > using log agents (like logentires) in standalone mode. Also allows > configuring rolling file appender without a mess when multiple applications > are running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28070) writeType and writeObject in SparkR should be handled by S3 methods
[ https://issues.apache.org/jira/browse/SPARK-28070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096044#comment-17096044 ] Apache Spark commented on SPARK-28070: -- User 'MichaelChirico' has created a pull request for this issue: https://github.com/apache/spark/pull/28379 > writeType and writeObject in SparkR should be handled by S3 methods > --- > > Key: SPARK-28070 > URL: https://issues.apache.org/jira/browse/SPARK-28070 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Michael Chirico >Priority: Major > > Corollary of https://issues.apache.org/jira/browse/SPARK-28040 > The way writeType and writeObject are handled now feels a bit hack-ish, would > be easier to manage with S3 or S4. > NB: S3 will require changing the order of arguments to call classes on object > (current first argument is con) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28070) writeType and writeObject in SparkR should be handled by S3 methods
[ https://issues.apache.org/jira/browse/SPARK-28070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096043#comment-17096043 ] Apache Spark commented on SPARK-28070: -- User 'MichaelChirico' has created a pull request for this issue: https://github.com/apache/spark/pull/28379 > writeType and writeObject in SparkR should be handled by S3 methods > --- > > Key: SPARK-28070 > URL: https://issues.apache.org/jira/browse/SPARK-28070 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Michael Chirico >Priority: Major > > Corollary of https://issues.apache.org/jira/browse/SPARK-28040 > The way writeType and writeObject are handled now feels a bit hack-ish, would > be easier to manage with S3 or S4. > NB: S3 will require changing the order of arguments to call classes on object > (current first argument is con) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28040) sql() fails to process output of glue::glue_data()
[ https://issues.apache.org/jira/browse/SPARK-28040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096042#comment-17096042 ] Apache Spark commented on SPARK-28040: -- User 'MichaelChirico' has created a pull request for this issue: https://github.com/apache/spark/pull/28379 > sql() fails to process output of glue::glue_data() > -- > > Key: SPARK-28040 > URL: https://issues.apache.org/jira/browse/SPARK-28040 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.3 >Reporter: Michael Chirico >Priority: Major > > {{glue}} package is quite natural for sending parameterized queries to Spark > from R. Very similar to Python's {{format}} for strings. Error is as simple as > {code:java} > library(glue) > library(sparkR) > sparkR.session() > query = glue_data(list(val = 4), 'select {val}') > sql(query){code} > Error in writeType(con, serdeType) : > Unsupported type for serialization glue > {{sql(as.character(query))}} works as expected but this is a bit awkward / > post-hoc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31350) Coalesce bucketed tables for join if applicable
[ https://issues.apache.org/jira/browse/SPARK-31350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31350: Assignee: (was: Apache Spark) > Coalesce bucketed tables for join if applicable > --- > > Key: SPARK-31350 > URL: https://issues.apache.org/jira/browse/SPARK-31350 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Major > > The following example of joining two bucketed tables introduces a full > shuffle: > {code:java} > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") > val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", > "k") > val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", > "k") > df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1") > df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2") > val t1 = spark.table("t1") > val t2 = spark.table("t2") > val joined = t1.join(t2, t1("i") === t2("i")) > joined.explain(true) > == Physical Plan == > *(5) SortMergeJoin [i#44], [i#50], Inner > :- *(2) Sort [i#44 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(i#44, 200), true, [id=#105] > : +- *(1) Project [i#44, j#45, k#46] > : +- *(1) Filter isnotnull(i#44) > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[i#44,j#45,k#46] Batched: true, > DataFilters: [isnotnull(i#44)], Format: Parquet, Location: > InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], > ReadSchema: struct, SelectedBucketsCount: 8 out of 8 > +- *(4) Sort [i#50 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(i#50, 200), true, [id=#115] > +- *(3) Project [i#50, j#51, k#52] > +- *(3) Filter isnotnull(i#50) > +- *(3) ColumnarToRow > +- FileScan parquet default.t2[i#50,j#51,k#52] Batched: true, > DataFilters: [isnotnull(i#50)], Format: Parquet, Location: > InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], > ReadSchema: struct, SelectedBucketsCount: 4 out of 4 > {code} > But one side can be coalesced to eliminate the shuffle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31350) Coalesce bucketed tables for join if applicable
[ https://issues.apache.org/jira/browse/SPARK-31350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31350: Assignee: Apache Spark > Coalesce bucketed tables for join if applicable > --- > > Key: SPARK-31350 > URL: https://issues.apache.org/jira/browse/SPARK-31350 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Major > > The following example of joining two bucketed tables introduces a full > shuffle: > {code:java} > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") > val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", > "k") > val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", > "k") > df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1") > df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2") > val t1 = spark.table("t1") > val t2 = spark.table("t2") > val joined = t1.join(t2, t1("i") === t2("i")) > joined.explain(true) > == Physical Plan == > *(5) SortMergeJoin [i#44], [i#50], Inner > :- *(2) Sort [i#44 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(i#44, 200), true, [id=#105] > : +- *(1) Project [i#44, j#45, k#46] > : +- *(1) Filter isnotnull(i#44) > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[i#44,j#45,k#46] Batched: true, > DataFilters: [isnotnull(i#44)], Format: Parquet, Location: > InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], > ReadSchema: struct, SelectedBucketsCount: 8 out of 8 > +- *(4) Sort [i#50 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(i#50, 200), true, [id=#115] > +- *(3) Project [i#50, j#51, k#52] > +- *(3) Filter isnotnull(i#50) > +- *(3) ColumnarToRow > +- FileScan parquet default.t2[i#50,j#51,k#52] Batched: true, > DataFilters: [isnotnull(i#50)], Format: Parquet, Location: > InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], > ReadSchema: struct, SelectedBucketsCount: 4 out of 4 > {code} > But one side can be coalesced to eliminate the shuffle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096032#comment-17096032 ] Apache Spark commented on SPARK-31519: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/28397 > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Blocker > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30642) LinearSVC blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096031#comment-17096031 ] Apache Spark commented on SPARK-30642: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/28349 > LinearSVC blockify input vectors > > > Key: SPARK-30642 > URL: https://issues.apache.org/jira/browse/SPARK-30642 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30642) LinearSVC blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30642: Assignee: zhengruifeng (was: Apache Spark) > LinearSVC blockify input vectors > > > Key: SPARK-30642 > URL: https://issues.apache.org/jira/browse/SPARK-30642 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30642) LinearSVC blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30642: Assignee: Apache Spark (was: zhengruifeng) > LinearSVC blockify input vectors > > > Key: SPARK-30642 > URL: https://issues.apache.org/jira/browse/SPARK-30642 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30642) LinearSVC blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096028#comment-17096028 ] Apache Spark commented on SPARK-30642: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/28349 > LinearSVC blockify input vectors > > > Key: SPARK-30642 > URL: https://issues.apache.org/jira/browse/SPARK-30642 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)
[ https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096020#comment-17096020 ] Apache Spark commented on SPARK-31586: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28402 > Replace expression TimeSub(l, r) with TimeAdd(l -r) > --- > > Key: SPARK-31586 > URL: https://issues.apache.org/jira/browse/SPARK-31586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.1.0 > > > The implementation of TimeSub for the operation of timestamp subtracting > interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, > -r) since there are equivalent. > Suggestion from > https://github.com/apache/spark/pull/28310#discussion_r414259239 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)
[ https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096016#comment-17096016 ] Apache Spark commented on SPARK-31586: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28402 > Replace expression TimeSub(l, r) with TimeAdd(l -r) > --- > > Key: SPARK-31586 > URL: https://issues.apache.org/jira/browse/SPARK-31586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.1.0 > > > The implementation of TimeSub for the operation of timestamp subtracting > interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, > -r) since there are equivalent. > Suggestion from > https://github.com/apache/spark/pull/28310#discussion_r414259239 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly
[ https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096009#comment-17096009 ] Apache Spark commented on SPARK-31549: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/28395 > Pyspark SparkContext.cancelJobGroup do not work correctly > - > > Key: SPARK-31549 > URL: https://issues.apache.org/jira/browse/SPARK-31549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Weichen Xu >Priority: Critical > > Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue > existing for a long time. This is because of pyspark thread didn't pinned to > jvm thread when invoking java side methods, which leads to all pyspark API > which related to java local thread variables do not work correctly. > (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` > and so on.) > This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode > added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two > issue: > * It is disabled by default. We need to set additional environment variable > to enable it. > * There's memory leak issue which haven't been addressed. > Now there's a series of project like hyperopt-spark, spark-joblib which rely > on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it > is critical to address this issue and we hope it work under default pyspark > mode. An optional approach is implementing methods like > `rdd.setGroupAndCollect`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly
[ https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31549: Assignee: (was: Apache Spark) > Pyspark SparkContext.cancelJobGroup do not work correctly > - > > Key: SPARK-31549 > URL: https://issues.apache.org/jira/browse/SPARK-31549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Weichen Xu >Priority: Critical > > Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue > existing for a long time. This is because of pyspark thread didn't pinned to > jvm thread when invoking java side methods, which leads to all pyspark API > which related to java local thread variables do not work correctly. > (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` > and so on.) > This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode > added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two > issue: > * It is disabled by default. We need to set additional environment variable > to enable it. > * There's memory leak issue which haven't been addressed. > Now there's a series of project like hyperopt-spark, spark-joblib which rely > on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it > is critical to address this issue and we hope it work under default pyspark > mode. An optional approach is implementing methods like > `rdd.setGroupAndCollect`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly
[ https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31549: Assignee: Apache Spark > Pyspark SparkContext.cancelJobGroup do not work correctly > - > > Key: SPARK-31549 > URL: https://issues.apache.org/jira/browse/SPARK-31549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Critical > > Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue > existing for a long time. This is because of pyspark thread didn't pinned to > jvm thread when invoking java side methods, which leads to all pyspark API > which related to java local thread variables do not work correctly. > (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` > and so on.) > This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode > added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two > issue: > * It is disabled by default. We need to set additional environment variable > to enable it. > * There's memory leak issue which haven't been addressed. > Now there's a series of project like hyperopt-spark, spark-joblib which rely > on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it > is critical to address this issue and we hope it work under default pyspark > mode. An optional approach is implementing methods like > `rdd.setGroupAndCollect`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly
[ https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096008#comment-17096008 ] Apache Spark commented on SPARK-31549: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/28395 > Pyspark SparkContext.cancelJobGroup do not work correctly > - > > Key: SPARK-31549 > URL: https://issues.apache.org/jira/browse/SPARK-31549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Weichen Xu >Priority: Critical > > Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue > existing for a long time. This is because of pyspark thread didn't pinned to > jvm thread when invoking java side methods, which leads to all pyspark API > which related to java local thread variables do not work correctly. > (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` > and so on.) > This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode > added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two > issue: > * It is disabled by default. We need to set additional environment variable > to enable it. > * There's memory leak issue which haven't been addressed. > Now there's a series of project like hyperopt-spark, spark-joblib which rely > on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it > is critical to address this issue and we hope it work under default pyspark > mode. An optional approach is implementing methods like > `rdd.setGroupAndCollect`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30132) Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses
[ https://issues.apache.org/jira/browse/SPARK-30132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096007#comment-17096007 ] Dongjoon Hyun commented on SPARK-30132: --- Thank you, [~tisue]. > Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses > > > Key: SPARK-30132 > URL: https://issues.apache.org/jira/browse/SPARK-30132 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Priority: Minor > > A few classes in our test code extend Hadoop's LocalFileSystem. Scala 2.13 > returns a compile error here - not for the Spark code, but because the Hadoop > code (it says) illegally overrides appendFile() with slightly different > generic types in its return value. This code is valid Java, evidently, and > the code actually doesn't define any generic types, so, I even wonder if it's > a scalac bug. > So far I don't see a workaround for this. > This only affects the Hadoop 3.2 build, in that it comes up with respect to a > method new in Hadoop 3. (There is actually another instance of a similar > problem that affects Hadoop 2, but I can see a tiny hack workaround for it). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26199) Long expressions cause mutate to fail
[ https://issues.apache.org/jira/browse/SPARK-26199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26199. -- Resolution: Duplicate > Long expressions cause mutate to fail > - > > Key: SPARK-26199 > URL: https://issues.apache.org/jira/browse/SPARK-26199 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: João Rafael >Priority: Minor > > Calling {{mutate(df, field = expr)}} fails when expr is very long. > Example: > {code:R} > df <- mutate(df, field = ifelse( > lit(TRUE), > lit("A"), > ifelse( > lit(T), > lit("BB"), > lit("C") > ) > )) > {code} > Stack trace: > {code:R} > FATAL subscript out of bounds > at .handleSimpleError(function (obj) > { > level = sapply(class(obj), sw > at FUN(X[[i]], ...) > at lapply(seq_along(args), function(i) { > if (ns[[i]] != "") { > at lapply(seq_along(args), function(i) { > if (ns[[i]] != "") { > at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB > at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T > {code} > The root cause is in: > [DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182] > When the expression is long {{deparse}} returns multiple lines, causing > {{args}} to have more elements than {{ns}}. The solution could be to set > {{nlines = 1}} or to collapse the lines together. > A simple work around exists, by first placing the expression in a variable > and using it instead: > {code:R} > tmp <- ifelse( > lit(TRUE), > lit("A"), > ifelse( > lit(T), > lit("BB"), > lit("C") > ) > ) > df <- mutate(df, field = tmp) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26199) Long expressions cause mutate to fail
[ https://issues.apache.org/jira/browse/SPARK-26199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096006#comment-17096006 ] Hyukjin Kwon commented on SPARK-26199: -- Thanks, [~michaelchirico] > Long expressions cause mutate to fail > - > > Key: SPARK-26199 > URL: https://issues.apache.org/jira/browse/SPARK-26199 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: João Rafael >Priority: Minor > > Calling {{mutate(df, field = expr)}} fails when expr is very long. > Example: > {code:R} > df <- mutate(df, field = ifelse( > lit(TRUE), > lit("A"), > ifelse( > lit(T), > lit("BB"), > lit("C") > ) > )) > {code} > Stack trace: > {code:R} > FATAL subscript out of bounds > at .handleSimpleError(function (obj) > { > level = sapply(class(obj), sw > at FUN(X[[i]], ...) > at lapply(seq_along(args), function(i) { > if (ns[[i]] != "") { > at lapply(seq_along(args), function(i) { > if (ns[[i]] != "") { > at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB > at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T > {code} > The root cause is in: > [DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182] > When the expression is long {{deparse}} returns multiple lines, causing > {{args}} to have more elements than {{ns}}. The solution could be to set > {{nlines = 1}} or to collapse the lines together. > A simple work around exists, by first placing the expression in a variable > and using it instead: > {code:R} > tmp <- ifelse( > lit(TRUE), > lit("A"), > ifelse( > lit(T), > lit("BB"), > lit("C") > ) > ) > df <- mutate(df, field = tmp) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31610) Expose hashFunc property in HashingTF
Weichen Xu created SPARK-31610: -- Summary: Expose hashFunc property in HashingTF Key: SPARK-31610 URL: https://issues.apache.org/jira/browse/SPARK-31610 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: Weichen Xu Expose hashFunc property in HashingTF Some third-party library such as mleap need to access it. See background description here: https://github.com/combust/mleap/pull/665#issuecomment-621258623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation
[ https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095948#comment-17095948 ] Eren Avsarogullari commented on SPARK-31566: Hi Pablo, There is ongoing PR on this: https://github.com/apache/spark/pull/28354 Also, It will be updated in the light of [https://github.com/apache/spark/pull/28208] Hope these help. Thanks > Add SQL Rest API Documentation > -- > > Key: SPARK-31566 > URL: https://issues.apache.org/jira/browse/SPARK-31566 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Priority: Major > > SQL Rest API exposes query execution metrics as Public API. Its documentation > will be useful for end-users. > {code:java} > /applications/[app-id]/sql > 1- A list of all queries for a given application. > 2- ?details=[true|false (default)] lists metric details in addition to > queries details. > 3- ?offset=[offset]=[len] lists queries in the given range.{code} > {code:java} > /applications/[app-id]/sql/[execution-id] > 1- Details for the given query. > 2- ?details=[true|false (default)] lists metric details in addition to given > query details.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30132) Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses
[ https://issues.apache.org/jira/browse/SPARK-30132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095936#comment-17095936 ] Seth Tisue commented on SPARK-30132: Scala 2.13.2 is now out. > Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses > > > Key: SPARK-30132 > URL: https://issues.apache.org/jira/browse/SPARK-30132 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Priority: Minor > > A few classes in our test code extend Hadoop's LocalFileSystem. Scala 2.13 > returns a compile error here - not for the Spark code, but because the Hadoop > code (it says) illegally overrides appendFile() with slightly different > generic types in its return value. This code is valid Java, evidently, and > the code actually doesn't define any generic types, so, I even wonder if it's > a scalac bug. > So far I don't see a workaround for this. > This only affects the Hadoop 3.2 build, in that it comes up with respect to a > method new in Hadoop 3. (There is actually another instance of a similar > problem that affects Hadoop 2, but I can see a tiny hack workaround for it). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31582) Be able to not populate Hadoop classpath
[ https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-31582: Fix Version/s: 3.0.0 > Be able to not populate Hadoop classpath > > > Key: SPARK-31582 > URL: https://issues.apache.org/jira/browse/SPARK-31582 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 2.4.5 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.6, 3.0.0 > > > Spark Yarn client will populate hadoop classpath from > `yarn.application.classpath` and ``mapreduce.application.classpath`. However, > for Spark with embedded hadoop build, it will result jar conflicts because > spark distribution can contain different version of hadoop jars. > We are adding a new Yarn configuration to not populate hadoop classpath from > `yarn.application.classpath` and ``mapreduce.application.classpath`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31582) Being able to not populate Hadoop classpath
[ https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-31582: Summary: Being able to not populate Hadoop classpath (was: Be able to not populate Hadoop classpath) > Being able to not populate Hadoop classpath > --- > > Key: SPARK-31582 > URL: https://issues.apache.org/jira/browse/SPARK-31582 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 2.4.5 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.6, 3.0.0 > > > Spark Yarn client will populate hadoop classpath from > `yarn.application.classpath` and ``mapreduce.application.classpath`. However, > for Spark with embedded hadoop build, it will result jar conflicts because > spark distribution can contain different version of hadoop jars. > We are adding a new Yarn configuration to not populate hadoop classpath from > `yarn.application.classpath` and ``mapreduce.application.classpath`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31582) Be able to not populate Hadoop classpath
[ https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-31582. - Fix Version/s: 2.4.6 Resolution: Fixed Issue resolved by pull request 28376 [https://github.com/apache/spark/pull/28376] > Be able to not populate Hadoop classpath > > > Key: SPARK-31582 > URL: https://issues.apache.org/jira/browse/SPARK-31582 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 2.4.5 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.6 > > > Spark Yarn client will populate hadoop classpath from > `yarn.application.classpath` and ``mapreduce.application.classpath`. However, > for Spark with embedded hadoop build, it will result jar conflicts because > spark distribution can contain different version of hadoop jars. > We are adding a new Yarn configuration to not populate hadoop classpath from > `yarn.application.classpath` and ``mapreduce.application.classpath`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31582) Be able to not populate Hadoop classpath
[ https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-31582: --- Assignee: DB Tsai > Be able to not populate Hadoop classpath > > > Key: SPARK-31582 > URL: https://issues.apache.org/jira/browse/SPARK-31582 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 2.4.5 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > > Spark Yarn client will populate hadoop classpath from > `yarn.application.classpath` and ``mapreduce.application.classpath`. However, > for Spark with embedded hadoop build, it will result jar conflicts because > spark distribution can contain different version of hadoop jars. > We are adding a new Yarn configuration to not populate hadoop classpath from > `yarn.application.classpath` and ``mapreduce.application.classpath`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation
[ https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095881#comment-17095881 ] Pablo Langa Blanco commented on SPARK-31566: I'm taking a look on this. I think is useful to add this documentation > Add SQL Rest API Documentation > -- > > Key: SPARK-31566 > URL: https://issues.apache.org/jira/browse/SPARK-31566 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Priority: Major > > SQL Rest API exposes query execution metrics as Public API. Its documentation > will be useful for end-users. > {code:java} > /applications/[app-id]/sql > 1- A list of all queries for a given application. > 2- ?details=[true|false (default)] lists metric details in addition to > queries details. > 3- ?offset=[offset]=[len] lists queries in the given range.{code} > {code:java} > /applications/[app-id]/sql/[execution-id] > 1- Details for the given query. > 2- ?details=[true|false (default)] lists metric details in addition to given > query details.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive
[ https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divya Paliwal resolved SPARK-31604. --- Resolution: Won't Do > java.lang.IllegalArgumentException: Frame length should be positive > --- > > Key: SPARK-31604 > URL: https://issues.apache.org/jira/browse/SPARK-31604 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.4 > Environment: Scala version 2.11.12 > Spark version 2.4.4 >Reporter: Divya Paliwal >Priority: Major > Fix For: 2.4.4 > > > Hi, > I am currently facing the below error when I run my code to stream data from > Couchbase in spark cluster. > 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in > connection from /[host]:56910 > java.lang.IllegalArgumentException: Frame length should be positive: > -9223371863711366549 > at > org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > > When ever I run the spark-submit command with all the arguments on the host I > get the above error. > > Thanks, > Divya -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31608) Add a hybrid KVStore to make UI loading faster
[ https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-31608: Description: This is a follow-up for the work done by Hieu Huynh in 2019. Add a new class HybridKVStore to make the history server faster when loading event files. When writing to this kvstore, it will first write to an in-memory store and having a background thread that keeps pushing the change to levelDB. I ran some tests on 3.0.1 on mac os: ||kvstore type / log size||100m||200m||500m||1g||2g|| |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to leveldb| |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| was: Add a new class HybridKVStore to make the history server faster when loading event files. When writing to this kvstore, it will first write to an in-memory store and having a background thread that keeps pushing the change to levelDB. I ran some tests on 3.0.1 on mac os: ||kvstore type / log size||100m||200m||500m||1g||2g|| |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to leveldb| |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| > Add a hybrid KVStore to make UI loading faster > -- > > Key: SPARK-31608 > URL: https://issues.apache.org/jira/browse/SPARK-31608 > Project: Spark > Issue Type: Story > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Priority: Major > > This is a follow-up for the work done by Hieu Huynh in 2019. > Add a new class HybridKVStore to make the history server faster when loading > event files. When writing to this kvstore, it will first write to an > in-memory store and having a background thread that keeps pushing the change > to levelDB. > I ran some tests on 3.0.1 on mac os: > ||kvstore type / log size||100m||200m||500m||1g||2g|| > |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to > leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to > leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to > leveldb| > |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31609) Add VarianceThresholdSelector to PySpark
[ https://issues.apache.org/jira/browse/SPARK-31609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095749#comment-17095749 ] Huaxin Gao commented on SPARK-31609: https://github.com/apache/spark/pull/28409 > Add VarianceThresholdSelector to PySpark > > > Key: SPARK-31609 > URL: https://issues.apache.org/jira/browse/SPARK-31609 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Add VarianceThresholdSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive
[ https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divya Paliwal updated SPARK-31604: -- Description: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) When ever I run the spark-submit command with all the arguments on the host I get the above error. Thanks, Divya was: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) When ever I run the spark-submit command with all the arguments on the host I get the above error. Command run: bin/spark-submit \ --deploy-mode cluster \ --class "com.CouchbaseRawMain" \ --master [spark://master-host:8091] \ --jars
[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive
[ https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divya Paliwal updated SPARK-31604: -- Description: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) When ever I run the spark-submit command with all the arguments on the host I get the above error. Command run: bin/spark-submit \ --deploy-mode cluster \ --class "com.CouchbaseRawMain" \ --master [spark://master-host:8091] \ --jars /ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/core-io-1.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/java-client-2.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxjava-1.3.8.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/couchbase-spark-connector_2.11-2.4.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/dcp-client-0.23.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/opentracing-api-0.31.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxscala_2.11-0.27.0.jar \ --conf spark.rpc.message.maxSize=1000 —-conf spark.shuffle.service.enabled=true —-conf spark.network.crypto.saslFallback=true —-conf spark.authenticate=true —-conf spark.network.crypto.enabled=true --driver-memory 3g --driver-cores 2 --num-executors 1 --executor-memory 3g --total-executor-cores 2 --executor-cores 2 \ /ngs/app/dcf5d/HD-cluster/spark/code/generic-couchbase-raw-0.1.0-SNAPSHOT.jar [spark://master-host:8091] Couchbase-DC-Bucket-Raw-QA-OG-28 host1:8091,host2:8091 DC [hdfs://master-host:9000/tables/couchbase/QA/DC-bucket/] Administrator dcadmin welcome true Thanks, Divya was: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at
[jira] [Created] (SPARK-31609) Add VarianceThresholdSelector to PySpark
Huaxin Gao created SPARK-31609: -- Summary: Add VarianceThresholdSelector to PySpark Key: SPARK-31609 URL: https://issues.apache.org/jira/browse/SPARK-31609 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 3.1.0 Reporter: Huaxin Gao Add VarianceThresholdSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive
[ https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divya Paliwal updated SPARK-31604: -- Description: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) When ever I run the spark-submit command with all the arguments on the host I get the above error. Command run: bin/spark-submit \ --deploy-mode cluster \ --class "com.pst.dc.CouchbaseRawMain" \ --master [spark://master-host:8091] \ --jars /ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/core-io-1.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/java-client-2.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxjava-1.3.8.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/couchbase-spark-connector_2.11-2.4.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/dcp-client-0.23.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/opentracing-api-0.31.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxscala_2.11-0.27.0.jar \ --conf spark.rpc.message.maxSize=1000 —-conf spark.shuffle.service.enabled=true —-conf spark.network.crypto.saslFallback=true —-conf spark.authenticate=true —-conf spark.network.crypto.enabled=true --driver-memory 3g --driver-cores 2 --num-executors 1 --executor-memory 3g --total-executor-cores 2 --executor-cores 2 \ /ngs/app/dcf5d/HD-cluster/spark/code/generic-couchbase-raw-0.1.0-SNAPSHOT.jar [spark://master-host:8091] Couchbase-DC-Bucket-Raw-QA-OG-28 host1:8091,host2:8091 DC [hdfs://master-host:9000/tables/couchbase/QA/DC-bucket/] Administrator dcadmin welcome true Thanks, Divya was: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at
[jira] [Created] (SPARK-31608) Add a hybrid KVStore to make UI loading faster
Baohe Zhang created SPARK-31608: --- Summary: Add a hybrid KVStore to make UI loading faster Key: SPARK-31608 URL: https://issues.apache.org/jira/browse/SPARK-31608 Project: Spark Issue Type: Story Components: Web UI Affects Versions: 3.0.1 Reporter: Baohe Zhang Add a new class HybridKVStore to make the history server faster when loading event files. When writing to this kvstore, it will first write to an in-memory store and having a background thread that keeps pushing the change to levelDB. I ran some tests on 3.0.1 on mac os: ||kvstore type / log size||100m||200m||500m||1g||2g|| |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to leveldb| |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31560) Add V1/V2 tests for TextSuite and WholeTextFileSuite
[ https://issues.apache.org/jira/browse/SPARK-31560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31560: -- Affects Version/s: (was: 3.0.1) > Add V1/V2 tests for TextSuite and WholeTextFileSuite > > > Key: SPARK-31560 > URL: https://issues.apache.org/jira/browse/SPARK-31560 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
[ https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095697#comment-17095697 ] Felix Kizhakkel Jose commented on SPARK-31599: -- Thank you [~gsomogyi]. But this is not a S3 issue. The issue is I have compacted files in the bucket and deleted the non compacted files, but didn't update/modify the "_spark_metadata" folder. And I could see that those Write Ahead Log Json files contain the deleted file names. And when I use Spark SQL to read the data, it first reads the Write Ahead logs from "_spark_metadata" and then try to read the files listed in it. So I am wondering how can we update the "_spark_metadata" content (Write Ahead Logs)? > Reading from S3 (Structured Streaming Bucket) Fails after Compaction > > > Key: SPARK-31599 > URL: https://issues.apache.org/jira/browse/SPARK-31599 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have a S3 bucket which has data streamed (Parquet format) to it by Spark > Structured Streaming Framework from Kafka. Periodically I try to run > compaction on this bucket (a separate Spark Job), and on successful > compaction delete the non compacted (parquet) files. After which I am getting > following error on Spark jobs which read from that bucket: > *Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* > How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need > to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095595#comment-17095595 ] Dongjoon Hyun commented on SPARK-31519: --- For the record, from 2.2.3 ~ 2.4.5, the following queries are used to see the wrong result. (2.0.2 and 2.1.3 has no problem with the following queries) {code} spark-sql> SELECT SUM(a) AS b, hour('2020-01-01 12:12:12') AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10; Time taken: 3.249 seconds spark-sql> SELECT SUM(a) AS b, '2020-01-01 12:12:12' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10; 2 2020-01-01 12:12:12 Time taken: 0.505 seconds, Fetched 1 row(s) {code} > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Blocker > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31519: -- Affects Version/s: 2.2.3 > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Blocker > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31519: -- Priority: Blocker (was: Major) > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Blocker > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31519: -- Affects Version/s: 2.3.4 > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31607) fix perf regression in CTESubstitution
[ https://issues.apache.org/jira/browse/SPARK-31607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31607: Affects Version/s: (was: 3.0.0) 3.1.0 > fix perf regression in CTESubstitution > -- > > Key: SPARK-31607 > URL: https://issues.apache.org/jira/browse/SPARK-31607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31607) fix perf regression in CTESubstitution
[ https://issues.apache.org/jira/browse/SPARK-31607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31607: Issue Type: Improvement (was: Bug) > fix perf regression in CTESubstitution > -- > > Key: SPARK-31607 > URL: https://issues.apache.org/jira/browse/SPARK-31607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31607) Improve the perf of CTESubstitution
[ https://issues.apache.org/jira/browse/SPARK-31607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31607: Summary: Improve the perf of CTESubstitution (was: fix perf regression in CTESubstitution) > Improve the perf of CTESubstitution > --- > > Key: SPARK-31607 > URL: https://issues.apache.org/jira/browse/SPARK-31607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095594#comment-17095594 ] Dongjoon Hyun commented on SPARK-31519: --- This is backported to branch-2.4 via https://github.com/apache/spark/pull/28397 > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31519: -- Fix Version/s: 2.4.6 > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31519: -- Affects Version/s: 2.4.5 > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive
[ https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divya Paliwal updated SPARK-31604: -- Description: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) When ever I run the spark-submit command with all the arguments on the host I get the above error. Command run: bin/spark-submit \ --deploy-mode cluster \ --class "com.apple.pst.dc.CouchbaseRawMain" \ --master [spark://master-host:8091] \ --jars /ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/core-io-1.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/java-client-2.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxjava-1.3.8.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/couchbase-spark-connector_2.11-2.4.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/dcp-client-0.23.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/opentracing-api-0.31.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxscala_2.11-0.27.0.jar \ --conf spark.rpc.message.maxSize=1000 —-conf spark.shuffle.service.enabled=true —-conf spark.network.crypto.saslFallback=true —-conf spark.authenticate=true —-conf spark.network.crypto.enabled=true --driver-memory 3g --driver-cores 2 --num-executors 1 --executor-memory 3g --total-executor-cores 2 --executor-cores 2 \ /ngs/app/dcf5d/HD-cluster/spark/code/generic-couchbase-raw-0.1.0-SNAPSHOT.jar [spark://master-host:8091] Couchbase-DC-Bucket-Raw-QA-OG-28 host1:8091,host2:8091 DC [hdfs://master-host:9000/tables/couchbase/QA/DC-bucket/] Administrator dcadmin welcome true Thanks, Divya was: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at
[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive
[ https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divya Paliwal updated SPARK-31604: -- Description: Hi, I am currently facing the below error when I run my code to stream data from Couchbase in spark cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) When ever I run the spark-submit command with all the arguments on the host I get the above error. Command run: bin/spark-submit \ --deploy-mode cluster \ --class "com.apple.pst.dc.CouchbaseRawMain" \ --master [spark://rn-dcf5t-lapp85.rno.apple.com:8091] \ --jars /ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/core-io-1.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/java-client-2.7.6.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxjava-1.3.8.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/couchbase-spark-connector_2.11-2.4.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/dcp-client-0.23.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/opentracing-api-0.31.0.jar,/ngs/app/dcf5d/HD-cluster/spark/jars/spark-couchbase-scala_2.11/rxscala_2.11-0.27.0.jar \ --conf spark.rpc.message.maxSize=1000 —-conf spark.shuffle.service.enabled=true —-conf spark.network.crypto.saslFallback=true —-conf spark.authenticate=true —-conf spark.network.crypto.enabled=true --driver-memory 3g --driver-cores 2 --num-executors 1 --executor-memory 3g --total-executor-cores 2 --executor-cores 2 \ /ngs/app/dcf5d/HD-cluster/spark/code/generic-couchbase-raw-0.1.0-SNAPSHOT.jar [spark://rn-dcf5t-lapp85.rno.apple.com:8091] Couchbase-DC-Bucket-Raw-QA-OG-28 rn-dcf5t-lapp87.rno.apple.com:8091,rn-dcf5t-lapp88.rno.apple.com:8091 DC [hdfs://rn-dcf5t-lapp85.rno.apple.com:9000/tables/couchbase/QA/DC-bucket/] Administrator dcadmin welcome true Thanks, Divya was: I am currently facing the below error when i run my code of streaming data from couchbase in spark master cluster. 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in connection from /[host]:56910 java.lang.IllegalArgumentException: Frame length should be positive: -9223371863711366549 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at
[jira] [Updated] (SPARK-31604) java.lang.IllegalArgumentException: Frame length should be positive
[ https://issues.apache.org/jira/browse/SPARK-31604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divya Paliwal updated SPARK-31604: -- Environment: Scala version 2.11.12 Spark version 2.4.4 > java.lang.IllegalArgumentException: Frame length should be positive > --- > > Key: SPARK-31604 > URL: https://issues.apache.org/jira/browse/SPARK-31604 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.4 > Environment: Scala version 2.11.12 > Spark version 2.4.4 >Reporter: Divya Paliwal >Priority: Major > Fix For: 2.4.4 > > > I am currently facing the below error when i run my code of streaming data > from couchbase in spark master cluster. > > 2020-04-29 00:04:06,061 WARN server.TransportChannelHandler: Exception in > connection from /[host]:56910 > java.lang.IllegalArgumentException: Frame length should be positive: > -9223371863711366549 > at > org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > > When ever I run the spark-submit command with all the arguments on the host I > get the above error. > > Thanks, > Divya -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31607) fix perf regression in CTESubstitution
Wenchen Fan created SPARK-31607: --- Summary: fix perf regression in CTESubstitution Key: SPARK-31607 URL: https://issues.apache.org/jira/browse/SPARK-31607 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31606) reduce the perf regression of vectorized parquet reader caused by datetime rebase
Wenchen Fan created SPARK-31606: --- Summary: reduce the perf regression of vectorized parquet reader caused by datetime rebase Key: SPARK-31606 URL: https://issues.apache.org/jira/browse/SPARK-31606 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3
[ https://issues.apache.org/jira/browse/SPARK-31605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Ashish updated SPARK-31605: Description: When performing inserting data with dynamic partition, the operation fails if all partitions are not dynamic. For example: {code:sql} create external table test_insert(a int) partitioned by (part_a string, part_b string) stored as parquet location ''; {code} The query {code:sql} insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); {code} will fails with errors {code:xml} Cannot create partition spec from hdfs:/// ; missing keys [part_a] Ignoring invalid DP directory {code} On the other hand, if I remove the static value of part_a to make the insert fully dynamic, the following query will succeed. Please note that below is not the issue . Issue is above one , where query throws invalid DP directory warning. {code:sql} insert into table test_insert partition(part_a, part_b) values (1,'a','b'); {code} was: When performing inserting data with dynamic partition, the operation fails if all partitions are not dynamic. For example: The query {code:sql} insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); {code} will fails with errors {code:xml} Cannot create partition spec from hdfs:/// ; missing keys [part_a] Ignoring invalid DP directory {code} On the other hand, if I remove the static value of part_a to make the insert fully dynamic, the following query will success. {code:sql} insert overwrite table t1 (part_a, part_b) select * from t2 {code} > Unable to insert data with partial dynamic partition with Spark & Hive 3 > > > Key: SPARK-31605 > URL: https://issues.apache.org/jira/browse/SPARK-31605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Hortonwork HDP 3.1.0 > Spark 2.3.2 > Hive 3 >Reporter: Amit Ashish >Priority: Major > > When performing inserting data with dynamic partition, the operation fails if > all partitions are not dynamic. For example: > > {code:sql} > create external table test_insert(a int) partitioned by (part_a string, > part_b string) stored as parquet location ''; > > {code} > The query > {code:sql} > insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); > {code} > will fails with errors > {code:xml} > Cannot create partition spec from hdfs:/// ; missing keys [part_a] > Ignoring invalid DP directory > {code} > > > > On the other hand, if I remove the static value of part_a to make the insert > fully dynamic, the following query will succeed. Please note that below is > not the issue . Issue is above one , where query throws invalid DP directory > warning. > {code:sql} > insert into table test_insert partition(part_a, part_b) values (1,'a','b'); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3
[ https://issues.apache.org/jira/browse/SPARK-31605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095462#comment-17095462 ] Amit Ashish commented on SPARK-31605: - previously closed ticket does not show the actual insert statement working. below is the query that is not working: insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); Getting below error: WARN FileOperations: Ignoring invalid DP directory hdfs://HDP3/warehouse/tablespace/external/hive/dw_analyst.db/test_insert/.hive-staging_hive_2020-04-29_13-28-46_360_4646016571504464856-1/-ext-1/part_b=b 20/04/29 13:28:52 INFO Hive: Loaded 0 partitions As mentioned in previous ticket , setting below does not make any difference: set hive.exec.dynamic.partition.mode=nonstrict; Neither setting spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict as spark config solves this . > Unable to insert data with partial dynamic partition with Spark & Hive 3 > > > Key: SPARK-31605 > URL: https://issues.apache.org/jira/browse/SPARK-31605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Hortonwork HDP 3.1.0 > Spark 2.3.2 > Hive 3 >Reporter: Amit Ashish >Priority: Major > > When performing inserting data with dynamic partition, the operation fails if > all partitions are not dynamic. For example: > The query > {code:sql} > insert overwrite table t1 (part_a='a', part_b) select * from t2 > {code} > will fails with errors > {code:xml} > Cannot create partition spec from hdfs:/// ; missing keys [part_a] > Ignoring invalid DP directory > {code} > On the other hand, if I remove the static value of part_a to make the insert > fully dynamic, the following query will success. > {code:sql} > insert overwrite table t1 (part_a, part_b) select * from t2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3
[ https://issues.apache.org/jira/browse/SPARK-31605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Ashish updated SPARK-31605: Description: When performing inserting data with dynamic partition, the operation fails if all partitions are not dynamic. For example: The query {code:sql} insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); {code} will fails with errors {code:xml} Cannot create partition spec from hdfs:/// ; missing keys [part_a] Ignoring invalid DP directory {code} On the other hand, if I remove the static value of part_a to make the insert fully dynamic, the following query will success. {code:sql} insert overwrite table t1 (part_a, part_b) select * from t2 {code} was: When performing inserting data with dynamic partition, the operation fails if all partitions are not dynamic. For example: The query {code:sql} insert overwrite table t1 (part_a='a', part_b) select * from t2 {code} will fails with errors {code:xml} Cannot create partition spec from hdfs:/// ; missing keys [part_a] Ignoring invalid DP directory {code} On the other hand, if I remove the static value of part_a to make the insert fully dynamic, the following query will success. {code:sql} insert overwrite table t1 (part_a, part_b) select * from t2 {code} > Unable to insert data with partial dynamic partition with Spark & Hive 3 > > > Key: SPARK-31605 > URL: https://issues.apache.org/jira/browse/SPARK-31605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Hortonwork HDP 3.1.0 > Spark 2.3.2 > Hive 3 >Reporter: Amit Ashish >Priority: Major > > When performing inserting data with dynamic partition, the operation fails if > all partitions are not dynamic. For example: > The query > {code:sql} > insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); > {code} > will fails with errors > {code:xml} > Cannot create partition spec from hdfs:/// ; missing keys [part_a] > Ignoring invalid DP directory > {code} > On the other hand, if I remove the static value of part_a to make the insert > fully dynamic, the following query will success. > {code:sql} > insert overwrite table t1 (part_a, part_b) select * from t2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3
[ https://issues.apache.org/jira/browse/SPARK-31605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095462#comment-17095462 ] Amit Ashish edited comment on SPARK-31605 at 4/29/20, 1:42 PM: --- previously closed ticket does not show the actual insert statement working. below is the query that is not working: insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); Getting below warning: WARN FileOperations: Ignoring invalid DP directory hdfs://HDP3/warehouse/tablespace/external/hive/dw_analyst.db/test_insert/.hive-staging_hive_2020-04-29_13-28-46_360_4646016571504464856-1/-ext-1/part_b=b 20/04/29 13:28:52 INFO Hive: Loaded 0 partitions As mentioned in previous ticket , setting below does not make any difference: set hive.exec.dynamic.partition.mode=nonstrict; Neither setting spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict as spark config solves this . Worst part is data does not get inserted and the return code is still 0 . Kindly either suggest a fix for this or enable a non-zero return code to track this in automated data pipelines . was (Author: dreamaaj): previously closed ticket does not show the actual insert statement working. below is the query that is not working: insert into table test_insert partition(part_a='a', part_b) values (3, 'b'); Getting below error: WARN FileOperations: Ignoring invalid DP directory hdfs://HDP3/warehouse/tablespace/external/hive/dw_analyst.db/test_insert/.hive-staging_hive_2020-04-29_13-28-46_360_4646016571504464856-1/-ext-1/part_b=b 20/04/29 13:28:52 INFO Hive: Loaded 0 partitions As mentioned in previous ticket , setting below does not make any difference: set hive.exec.dynamic.partition.mode=nonstrict; Neither setting spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict as spark config solves this . > Unable to insert data with partial dynamic partition with Spark & Hive 3 > > > Key: SPARK-31605 > URL: https://issues.apache.org/jira/browse/SPARK-31605 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Hortonwork HDP 3.1.0 > Spark 2.3.2 > Hive 3 >Reporter: Amit Ashish >Priority: Major > > When performing inserting data with dynamic partition, the operation fails if > all partitions are not dynamic. For example: > The query > {code:sql} > insert overwrite table t1 (part_a='a', part_b) select * from t2 > {code} > will fails with errors > {code:xml} > Cannot create partition spec from hdfs:/// ; missing keys [part_a] > Ignoring invalid DP directory > {code} > On the other hand, if I remove the static value of part_a to make the insert > fully dynamic, the following query will success. > {code:sql} > insert overwrite table t1 (part_a, part_b) select * from t2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31605) Unable to insert data with partial dynamic partition with Spark & Hive 3
Amit Ashish created SPARK-31605: --- Summary: Unable to insert data with partial dynamic partition with Spark & Hive 3 Key: SPARK-31605 URL: https://issues.apache.org/jira/browse/SPARK-31605 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Environment: Hortonwork HDP 3.1.0 Spark 2.3.2 Hive 3 Reporter: Amit Ashish When performing inserting data with dynamic partition, the operation fails if all partitions are not dynamic. For example: The query {code:sql} insert overwrite table t1 (part_a='a', part_b) select * from t2 {code} will fails with errors {code:xml} Cannot create partition spec from hdfs:/// ; missing keys [part_a] Ignoring invalid DP directory {code} On the other hand, if I remove the static value of part_a to make the insert fully dynamic, the following query will success. {code:sql} insert overwrite table t1 (part_a, part_b) select * from t2 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org