[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843620#comment-17843620 ] Sandeep Katta commented on SPARK-35531: --- Bug is tracked here https://issues.apache.org/jira/browse/SPARK-48140 > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0, 3.1.4 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48140) Can not alter bucketed table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-48140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-48140: -- Description: Running below SQL command throws exception CREATE TABLE TEST1( V1 BIGINT, S1 INT) PARTITIONED BY (PK BIGINT) CLUSTERED BY (V1) SORTED BY (S1) INTO 200 BUCKETS STORED AS PARQUET; ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); *Exception:* {code:java} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), FieldSchema(name:s1, type:int, comment:null)] at org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552) at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1145) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$alterTable$1(HiveClientImpl.scala:594) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:587) at org.apache.spark.sql.hive.client.HiveClient.alterTable(HiveClient.scala:124) at org.apache.spark.sql.hive.client.HiveClient.alterTable$(HiveClient.scala:123) at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:93) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$alterTable$1(HiveExternalCatalog.scala:687) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) ... 62 more {code} was: Running below SQL command throws exception CREATE TABLE TEST1( V1 BIGINT, S1 INT) PARTITIONED BY (PK BIGINT) CLUSTERED BY (V1) SORTED BY (S1) INTO 200 BUCKETS STORED AS PARQUET; ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); > Can not alter bucketed table if create table with upper case schema > --- > > Key: SPARK-48140 > URL: https://issues.apache.org/jira/browse/SPARK-48140 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Sandeep Katta >Priority: Major > > Running below SQL command throws exception > > CREATE TABLE TEST1( > V1 BIGINT, > S1 INT) > PARTITIONED BY (PK BIGINT) > CLUSTERED BY (V1) > SORTED BY (S1) > INTO 200 BUCKETS > STORED AS PARQUET; > ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); > *Exception:* > {code:java} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns > V1 is not part of the table columns ([FieldSchema(name:v1, type:bigint, > comment:null), FieldSchema(name:s1, type:int, comment:null)] > at > org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1145) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$alterTable$1(HiveClientImpl.scala:594) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:587) > at > org.apache.spark.sql.hive.client.HiveClient.alterTable(HiveClient.scala:124) > at > org.apache.spark.sql.hive.client.HiveClient.alterTable$(HiveClient.scala:123) > at > org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:93) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$alterTable$1(HiveExternalCatalog.scala:687) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > ... 62 more > {code} -- This message was sent by Atlassian Jira (v8.
[jira] [Created] (SPARK-48140) Can not alter bucketed table if create table with upper case schema
Sandeep Katta created SPARK-48140: - Summary: Can not alter bucketed table if create table with upper case schema Key: SPARK-48140 URL: https://issues.apache.org/jira/browse/SPARK-48140 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Sandeep Katta Running below SQL command throws exception CREATE TABLE TEST1( V1 BIGINT, S1 INT) PARTITIONED BY (PK BIGINT) CLUSTERED BY (V1) SORTED BY (S1) INTO 200 BUCKETS STORED AS PARQUET; ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843619#comment-17843619 ] Sandeep Katta commented on SPARK-35531: --- [~angerszhuuu] , I do see same issue in alter table command, I tested in SPARK-3.5.0 and issue still exists {code:java} CREATE TABLE TEST1( V1 BIGINT, S1 INT) PARTITIONED BY (PK BIGINT) CLUSTERED BY (V1) SORTED BY (S1) INTO 200 BUCKETS STORED AS PARQUET; ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); {code} {code:java} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), FieldSchema(name:s1, type:int, comment:null)] at org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552) at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1145) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$alterTable$1(HiveClientImpl.scala:594) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:587) at org.apache.spark.sql.hive.client.HiveClient.alterTable(HiveClient.scala:124) at org.apache.spark.sql.hive.client.HiveClient.alterTable$(HiveClient.scala:123) at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:93) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$alterTable$1(HiveExternalCatalog.scala:687) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) ... 62 more {code} > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0, 3.1.4 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44265) Built-in XML data source support
[ https://issues.apache.org/jira/browse/SPARK-44265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744535#comment-17744535 ] Sandeep Katta commented on SPARK-44265: --- [~sandip.agarwala] could you update the SPIP link here also please ? > Built-in XML data source support > > > Key: SPARK-44265 > URL: https://issues.apache.org/jira/browse/SPARK-44265 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.5.0 >Reporter: Sandip Agarwala >Priority: Major > > XML is a widely used data format. An external spark-xml package > ([https://github.com/databricks/spark-xml)] is available to read and write > XML data in spark. Making spark-xml built-in will provide a better user > experience for Spark SQL and structured streaming. The proposal is to inline > code from spark-xml package. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side
[ https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta resolved SPARK-39257. --- Resolution: Not A Problem Issue is caused by mssql-jdbc driver and is fixed in 12.1.0 version using the PR [1942|https://github.com/microsoft/mssql-jdbc/pull/1942] > use spark.read.jdbc() to read data from SQL databse into dataframe, it fails > silently, when the session is killed from SQL server side > -- > > Key: SPARK-39257 > URL: https://issues.apache.org/jira/browse/SPARK-39257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.2, 3.2.1 > Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1 > *Microsoft JDBC Driver* *for SQL server:* > mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar >Reporter: Xinran Tao >Priority: Major > > I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, > which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the > SQL server. > *codes:* > > {code:java} > %scala > val token = "xxx" > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = 1433 > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken=" > import java.util.Properties > val connectionProperties = new Properties() > val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > connectionProperties.setProperty("Driver", driverClass) > connectionProperties.setProperty("accesstoken", token) > val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias" > val df_stripe_dispute = spark.read.option("connectRetryCount", > 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, > properties=connectionProperties) > df_stripe_dispute.count() > {code} > > > The session was accidentally killed by some automatic scripts from SQL server > side, but no errors shows up from the spark side, no failure was observed. > But from the count() result, the reords are far less than it should be. > > If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the > query and print the data out, which doesn't involve spark, there would be a > connection reset error thrown out. > *codes:* > > {code:java} > %scala > import java.sql.DriverManager > import java.sql.Connection > import java.util.Properties; > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = "1433" > val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > val token = "" > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken="+token > > var connection:Connection = null > val info:Properties = new Properties(); > info.setProperty("accesstoken", token); > > // make the connection > Class.forName(driver) > connection = DriverManager.getConnection(jdbcUrl,info ) > // create the statement, and run the select query > val statement = connection.createStatement() > val resultSet = statement.executeQuery("select UNITS from > payment_balance_new") > while ( resultSet.next() ) { > println("__"+resultSet.getString(1)) > } > {code} > > *errors:* > > {code:java} > com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998) > at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at > com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at > com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at > com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuffer.java:6374) at > com.microsoft.sqlserver.jdbc.TDSReader.readBytes(IOBuffer.java:6675) at > com.microsoft.sqlserver.jdbc.TDSReader.readWrappedBytes(IOBuffer.java:6696) > at com.microsoft.sqlserver.jdbc.TDSReader.readInt(IOBuffer.java:6645) at > com.microsoft.sqlserver.jdbc.TDSReader.readUnsignedInt(IOBuffer.java:6659) at > com.microsoft.sqlserver.jdbc.PLPInputStream.readBytesInternal(PLPInputStream.java:309) > at > com.microsoft.sqlserver.jdbc.PLPInputStream.getBytes(PLPInputStream.java:105) > at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:757) at > com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:3748) at > com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:247) at > com.microsoft.sqlserver.jdbc.Column.ge
[jira] [Commented] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side
[ https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643501#comment-17643501 ] Sandeep Katta commented on SPARK-39257: --- Closing this jira as this is fixed by *mssql-jdbc [1942|https://github.com/microsoft/mssql-jdbc/pull/1942]* > use spark.read.jdbc() to read data from SQL databse into dataframe, it fails > silently, when the session is killed from SQL server side > -- > > Key: SPARK-39257 > URL: https://issues.apache.org/jira/browse/SPARK-39257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.2, 3.2.1 > Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1 > *Microsoft JDBC Driver* *for SQL server:* > mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar >Reporter: Xinran Tao >Priority: Major > > I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, > which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the > SQL server. > *codes:* > > {code:java} > %scala > val token = "xxx" > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = 1433 > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken=" > import java.util.Properties > val connectionProperties = new Properties() > val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > connectionProperties.setProperty("Driver", driverClass) > connectionProperties.setProperty("accesstoken", token) > val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias" > val df_stripe_dispute = spark.read.option("connectRetryCount", > 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, > properties=connectionProperties) > df_stripe_dispute.count() > {code} > > > The session was accidentally killed by some automatic scripts from SQL server > side, but no errors shows up from the spark side, no failure was observed. > But from the count() result, the reords are far less than it should be. > > If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the > query and print the data out, which doesn't involve spark, there would be a > connection reset error thrown out. > *codes:* > > {code:java} > %scala > import java.sql.DriverManager > import java.sql.Connection > import java.util.Properties; > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = "1433" > val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > val token = "" > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken="+token > > var connection:Connection = null > val info:Properties = new Properties(); > info.setProperty("accesstoken", token); > > // make the connection > Class.forName(driver) > connection = DriverManager.getConnection(jdbcUrl,info ) > // create the statement, and run the select query > val statement = connection.createStatement() > val resultSet = statement.executeQuery("select UNITS from > payment_balance_new") > while ( resultSet.next() ) { > println("__"+resultSet.getString(1)) > } > {code} > > *errors:* > > {code:java} > com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998) > at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at > com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at > com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at > com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuffer.java:6374) at > com.microsoft.sqlserver.jdbc.TDSReader.readBytes(IOBuffer.java:6675) at > com.microsoft.sqlserver.jdbc.TDSReader.readWrappedBytes(IOBuffer.java:6696) > at com.microsoft.sqlserver.jdbc.TDSReader.readInt(IOBuffer.java:6645) at > com.microsoft.sqlserver.jdbc.TDSReader.readUnsignedInt(IOBuffer.java:6659) at > com.microsoft.sqlserver.jdbc.PLPInputStream.readBytesInternal(PLPInputStream.java:309) > at > com.microsoft.sqlserver.jdbc.PLPInputStream.getBytes(PLPInputStream.java:105) > at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:757) at > com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:3748) at > com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:247) at > com.microsoft.sqlserver.jdbc.Column.getValu
[jira] [Comment Edited] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side
[ https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641736#comment-17641736 ] Sandeep Katta edited comment on SPARK-39257 at 12/1/22 8:02 AM: [~xinrantao] I believe you are facing same issue as this [https://github.com/microsoft/mssql-jdbc/issues/1846.] And this is fixed in the version [12.1.0 |https://github.com/microsoft/mssql-jdbc/releases/tag/v12.1.0] by the *mssql-jdbc* team using the PR [1942|https://github.com/microsoft/mssql-jdbc/pull/1942] . So you can use the mssql-jdbc jar with version 12.1.0 to fix this issue. was (Author: sandeep.katta2007): [~xinrantao] I believe you are facing same issue as this [https://github.com/microsoft/mssql-jdbc/issues/1846.] And this is fixed by the *mssql-jdbc* team using the PR [1942.|https://github.com/microsoft/mssql-jdbc/pull/1942] which is available in the release [12.1.0 |https://github.com/microsoft/mssql-jdbc/releases/tag/v12.1.0] . You can this upgraded jar to solve this issue. > use spark.read.jdbc() to read data from SQL databse into dataframe, it fails > silently, when the session is killed from SQL server side > -- > > Key: SPARK-39257 > URL: https://issues.apache.org/jira/browse/SPARK-39257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.2, 3.2.1 > Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1 > *Microsoft JDBC Driver* *for SQL server:* > mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar >Reporter: Xinran Tao >Priority: Major > > I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, > which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the > SQL server. > *codes:* > > {code:java} > %scala > val token = "xxx" > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = 1433 > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken=" > import java.util.Properties > val connectionProperties = new Properties() > val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > connectionProperties.setProperty("Driver", driverClass) > connectionProperties.setProperty("accesstoken", token) > val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias" > val df_stripe_dispute = spark.read.option("connectRetryCount", > 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, > properties=connectionProperties) > df_stripe_dispute.count() > {code} > > > The session was accidentally killed by some automatic scripts from SQL server > side, but no errors shows up from the spark side, no failure was observed. > But from the count() result, the reords are far less than it should be. > > If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the > query and print the data out, which doesn't involve spark, there would be a > connection reset error thrown out. > *codes:* > > {code:java} > %scala > import java.sql.DriverManager > import java.sql.Connection > import java.util.Properties; > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = "1433" > val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > val token = "" > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken="+token > > var connection:Connection = null > val info:Properties = new Properties(); > info.setProperty("accesstoken", token); > > // make the connection > Class.forName(driver) > connection = DriverManager.getConnection(jdbcUrl,info ) > // create the statement, and run the select query > val statement = connection.createStatement() > val resultSet = statement.executeQuery("select UNITS from > payment_balance_new") > while ( resultSet.next() ) { > println("__"+resultSet.getString(1)) > } > {code} > > *errors:* > > {code:java} > com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998) > at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at > com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at > com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at > com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuff
[jira] [Commented] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side
[ https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641736#comment-17641736 ] Sandeep Katta commented on SPARK-39257: --- [~xinrantao] I believe you are facing same issue as this [https://github.com/microsoft/mssql-jdbc/issues/1846.] And this is fixed by the *mssql-jdbc* team using the PR [1942.|https://github.com/microsoft/mssql-jdbc/pull/1942] which is available in the release [12.1.0 |https://github.com/microsoft/mssql-jdbc/releases/tag/v12.1.0] . You can this upgraded jar to solve this issue. > use spark.read.jdbc() to read data from SQL databse into dataframe, it fails > silently, when the session is killed from SQL server side > -- > > Key: SPARK-39257 > URL: https://issues.apache.org/jira/browse/SPARK-39257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.2, 3.2.1 > Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1 > *Microsoft JDBC Driver* *for SQL server:* > mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar >Reporter: Xinran Tao >Priority: Major > > I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, > which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the > SQL server. > *codes:* > > {code:java} > %scala > val token = "xxx" > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = 1433 > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken=" > import java.util.Properties > val connectionProperties = new Properties() > val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > connectionProperties.setProperty("Driver", driverClass) > connectionProperties.setProperty("accesstoken", token) > val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias" > val df_stripe_dispute = spark.read.option("connectRetryCount", > 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, > properties=connectionProperties) > df_stripe_dispute.count() > {code} > > > The session was accidentally killed by some automatic scripts from SQL server > side, but no errors shows up from the spark side, no failure was observed. > But from the count() result, the reords are far less than it should be. > > If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the > query and print the data out, which doesn't involve spark, there would be a > connection reset error thrown out. > *codes:* > > {code:java} > %scala > import java.sql.DriverManager > import java.sql.Connection > import java.util.Properties; > val jdbcHostname = "xinrandatabseserver.database.windows.net" > val jdbcDatabase = "xinranSQLDatabase" > val jdbcPort = "1433" > val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver" > val token = "" > val jdbcUrl = > "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname, > jdbcPort, jdbcDatabase)+ ";accessToken="+token > > var connection:Connection = null > val info:Properties = new Properties(); > info.setProperty("accesstoken", token); > > // make the connection > Class.forName(driver) > connection = DriverManager.getConnection(jdbcUrl,info ) > // create the statement, and run the select query > val statement = connection.createStatement() > val resultSet = statement.executeQuery("select UNITS from > payment_balance_new") > while ( resultSet.next() ) { > println("__"+resultSet.getString(1)) > } > {code} > > *errors:* > > {code:java} > com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset > at > com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998) > at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at > com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at > com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at > com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuffer.java:6374) at > com.microsoft.sqlserver.jdbc.TDSReader.readBytes(IOBuffer.java:6675) at > com.microsoft.sqlserver.jdbc.TDSReader.readWrappedBytes(IOBuffer.java:6696) > at com.microsoft.sqlserver.jdbc.TDSReader.readInt(IOBuffer.java:6645) at > com.microsoft.sqlserver.jdbc.TDSReader.readUnsignedInt(IOBuffer.java:6659) at > com.microsoft.sqlserver.jdbc.PLPInputStream.readBytesInternal(PLPInputStream.java:309) > at > com.microsoft.sqlserver.jdbc.PLPInputStream.getBytes(PL
[jira] [Commented] (SPARK-41235) High-order function: array_compact
[ https://issues.apache.org/jira/browse/SPARK-41235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641315#comment-17641315 ] Sandeep Katta commented on SPARK-41235: --- I will be working on this > High-order function: array_compact > -- > > Key: SPARK-41235 > URL: https://issues.apache.org/jira/browse/SPARK-41235 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_compact.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38474) Use error classes in org.apache.spark.security
[ https://issues.apache.org/jira/browse/SPARK-38474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542473#comment-17542473 ] Sandeep Katta commented on SPARK-38474: --- I will start working on this > Use error classes in org.apache.spark.security > -- > > Key: SPARK-38474 > URL: https://issues.apache.org/jira/browse/SPARK-38474 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case
Sandeep Katta created SPARK-37585: - Summary: DSV2 InputMetrics are not getting update in corner case Key: SPARK-37585 URL: https://issues.apache.org/jira/browse/SPARK-37585 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2, 3.0.3 Reporter: Sandeep Katta In some corner cases, DSV2 is not updating the input metrics. This is very special case where the number of records read are less than 1000 and *hasNext* is not called for last element(cz input.hasNext returns false so MetricsIterator.hasNext is not called) hasNext implementation of MetricsIterator {code:java} override def hasNext: Boolean = { if (iter.hasNext) { true } else { metricsHandler.updateMetrics(0, force = true) false } {code} You reproduce this issue easily in spark-shell by running below code {code:java} import scala.collection.mutable import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "") val dir = "Users/tmp1" spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir) val df = spark.read.format("parquet").load(dir) val bytesReads = new mutable.ArrayBuffer[Long]() val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new SparkListener() { override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead } } spark.sparkContext.addSparkListener(bytesReadListener) try { df.limit(10).collect() assert(recordsRead.sum > 0) assert(bytesReads.sum > 0) } finally { spark.sparkContext.removeSparkListener(bytesReadListener) } {code} This code generally fails at *assert(bytesReads.sum > 0)* which confirms that updateMetrics API is not called -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37578) DSV2 is not updating Output Metrics
Sandeep Katta created SPARK-37578: - Summary: DSV2 is not updating Output Metrics Key: SPARK-37578 URL: https://issues.apache.org/jira/browse/SPARK-37578 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2, 3.0.3 Reporter: Sandeep Katta Repro code ./bin/spark-shell --master local --jars /Users/jars/iceberg-spark3-runtime-0.12.1.jar {code:java} import scala.collection.mutable import org.apache.spark.scheduler._val bytesWritten = new mutable.ArrayBuffer[Long]() val recordsWritten = new mutable.ArrayBuffer[Long]() val bytesWrittenListener = new SparkListener() { override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten } } spark.sparkContext.addSparkListener(bytesWrittenListener) try { val df = spark.range(1000).toDF("id") df.write.format("iceberg").save("Users/data/dsv2_test") assert(bytesWritten.sum > 0) assert(recordsWritten.sum > 0) } finally { spark.sparkContext.removeSparkListener(bytesWrittenListener) } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35272) org.apache.spark.SparkException: Task not serializable
[ https://issues.apache.org/jira/browse/SPARK-35272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338301#comment-17338301 ] Sandeep Katta edited comment on SPARK-35272 at 5/3/21, 10:41 AM: - That is correct if I am broadcasting *NonSerializable,* but in this case I am broadcasting *Student* which is serialized was (Author: sandeep.katta2007): That is correct if I am broadcasting *NonSerializable,* but in this case I am broadcasting Student which is serialized > org.apache.spark.SparkException: Task not serializable > -- > > Key: SPARK-35272 > URL: https://issues.apache.org/jira/browse/SPARK-35272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Sandeep Katta >Priority: Major > Attachments: ExceptionStack.txt > > > I am getting NotSerializableException when broadcasting Serialized class. > > Minimal code with which you can reproduce this issue > > {code:java} > case class Student(name: String) > class NonSerializable() { > def getText() : String = { > """ > |[ > |{ > | "name": "test1" > |}, > |{ > | "name": "test2" > |} > |] > |""".stripMargin > } > } > val obj = new NonSerializable() > val descriptors_string = obj.getText() > import com.github.plokhotnyuk.jsoniter_scala.macros._ > import com.github.plokhotnyuk.jsoniter_scala.core._ > val parsed_descriptors: Array[Student] = > readFromString[Array[Student]](descriptors_string)(JsonCodecMaker.make) > val broadcast_descriptors = spark.sparkContext.broadcast(parsed_descriptors) > def foo(data: String): Seq[Any] = { > import scala.collection.mutable.ArrayBuffer > val res = new ArrayBuffer[String]() > for (desc <- broadcast_descriptors.value) { > res += desc.name > } > res > } > val data = spark.sparkContext.parallelize(Array("1", "2", "3")).map(x => > foo(x)).collect() > {code} > Command used to start spark-shell > {code:java} > ./bin/spark-shell --master local --jars > /Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-macros_2.12/2.7.1/jsoniter-scala-macros_2.12-2.7.1.jar,/Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-core_2.12/2.7.1/jsoniter-scala-core_2.12-2.7.1.jar > {code} > *Exception Details* > {code:java} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2362) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.map(RDD.scala:395) > ... 51 elided > Caused by: java.io.NotSerializableException: NonSerializable > Serialization stack: > - object not serializable (class: NonSerializable, value: > NonSerializable@7c95440d) > - field (class: $iw, name: obj, type: class NonSerializable) > - object (class $iw, $iw@3ed476a8) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@381f7b8e) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@23635e39) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@1a8b3791) > - field (class: $iw, name: $iw, type: class $iw) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35272) org.apache.spark.SparkException: Task not serializable
[ https://issues.apache.org/jira/browse/SPARK-35272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338301#comment-17338301 ] Sandeep Katta commented on SPARK-35272: --- That is correct if I am broadcasting *NonSerializable,* but in this case I am broadcasting Student which is serialized > org.apache.spark.SparkException: Task not serializable > -- > > Key: SPARK-35272 > URL: https://issues.apache.org/jira/browse/SPARK-35272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Sandeep Katta >Priority: Major > Attachments: ExceptionStack.txt > > > I am getting NotSerializableException when broadcasting Serialized class. > > Minimal code with which you can reproduce this issue > > {code:java} > case class Student(name: String) > class NonSerializable() { > def getText() : String = { > """ > |[ > |{ > | "name": "test1" > |}, > |{ > | "name": "test2" > |} > |] > |""".stripMargin > } > } > val obj = new NonSerializable() > val descriptors_string = obj.getText() > import com.github.plokhotnyuk.jsoniter_scala.macros._ > import com.github.plokhotnyuk.jsoniter_scala.core._ > val parsed_descriptors: Array[Student] = > readFromString[Array[Student]](descriptors_string)(JsonCodecMaker.make) > val broadcast_descriptors = spark.sparkContext.broadcast(parsed_descriptors) > def foo(data: String): Seq[Any] = { > import scala.collection.mutable.ArrayBuffer > val res = new ArrayBuffer[String]() > for (desc <- broadcast_descriptors.value) { > res += desc.name > } > res > } > val data = spark.sparkContext.parallelize(Array("1", "2", "3")).map(x => > foo(x)).collect() > {code} > Command used to start spark-shell > {code:java} > ./bin/spark-shell --master local --jars > /Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-macros_2.12/2.7.1/jsoniter-scala-macros_2.12-2.7.1.jar,/Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-core_2.12/2.7.1/jsoniter-scala-core_2.12-2.7.1.jar > {code} > *Exception Details* > {code:java} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2362) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.map(RDD.scala:395) > ... 51 elided > Caused by: java.io.NotSerializableException: NonSerializable > Serialization stack: > - object not serializable (class: NonSerializable, value: > NonSerializable@7c95440d) > - field (class: $iw, name: obj, type: class NonSerializable) > - object (class $iw, $iw@3ed476a8) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@381f7b8e) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@23635e39) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@1a8b3791) > - field (class: $iw, name: $iw, type: class $iw) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35272) org.apache.spark.SparkException: Task not serializable
[ https://issues.apache.org/jira/browse/SPARK-35272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338297#comment-17338297 ] Sandeep Katta edited comment on SPARK-35272 at 5/3/21, 10:30 AM: - This is the minimal code with which we can reproduce this issue, in reality this *NonSerializable* class contains objects to 3rd party library which cannot be *serialized* This issue can also be solved by using *trasient* keyword like below, {code:java} @transient val obj = new NonSerializable() val descriptors_string = obj.getText() {code} But this does not makes sense as I am broadcasting the *Seq[Student]* not *NonSerializable* was (Author: sandeep.katta2007): This is the minimal code with which we can reproduce this issue, in reality this *NonSerializable* class contains objects to 3rd party library which cannot be *serialized* > org.apache.spark.SparkException: Task not serializable > -- > > Key: SPARK-35272 > URL: https://issues.apache.org/jira/browse/SPARK-35272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Sandeep Katta >Priority: Major > Attachments: ExceptionStack.txt > > > I am getting NotSerializableException when broadcasting Serialized class. > > Minimal code with which you can reproduce this issue > > {code:java} > case class Student(name: String) > class NonSerializable() { > def getText() : String = { > """ > |[ > |{ > | "name": "test1" > |}, > |{ > | "name": "test2" > |} > |] > |""".stripMargin > } > } > val obj = new NonSerializable() > val descriptors_string = obj.getText() > import com.github.plokhotnyuk.jsoniter_scala.macros._ > import com.github.plokhotnyuk.jsoniter_scala.core._ > val parsed_descriptors: Array[Student] = > readFromString[Array[Student]](descriptors_string)(JsonCodecMaker.make) > val broadcast_descriptors = spark.sparkContext.broadcast(parsed_descriptors) > def foo(data: String): Seq[Any] = { > import scala.collection.mutable.ArrayBuffer > val res = new ArrayBuffer[String]() > for (desc <- broadcast_descriptors.value) { > res += desc.name > } > res > } > val data = spark.sparkContext.parallelize(Array("1", "2", "3")).map(x => > foo(x)).collect() > {code} > Command used to start spark-shell > {code:java} > ./bin/spark-shell --master local --jars > /Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-macros_2.12/2.7.1/jsoniter-scala-macros_2.12-2.7.1.jar,/Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-core_2.12/2.7.1/jsoniter-scala-core_2.12-2.7.1.jar > {code} > *Exception Details* > {code:java} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2362) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.map(RDD.scala:395) > ... 51 elided > Caused by: java.io.NotSerializableException: NonSerializable > Serialization stack: > - object not serializable (class: NonSerializable, value: > NonSerializable@7c95440d) > - field (class: $iw, name: obj, type: class NonSerializable) > - object (class $iw, $iw@3ed476a8) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@381f7b8e) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@23635e39) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@1a8b3791) > - field (class: $iw, name: $iw, type: class $iw) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35272) org.apache.spark.SparkException: Task not serializable
[ https://issues.apache.org/jira/browse/SPARK-35272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338297#comment-17338297 ] Sandeep Katta commented on SPARK-35272: --- This is the minimal code with which we can reproduce this issue, in reality this *NonSerializable* class contains objects to 3rd party library which cannot be *serialized* > org.apache.spark.SparkException: Task not serializable > -- > > Key: SPARK-35272 > URL: https://issues.apache.org/jira/browse/SPARK-35272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Sandeep Katta >Priority: Major > Attachments: ExceptionStack.txt > > > I am getting NotSerializableException when broadcasting Serialized class. > > Minimal code with which you can reproduce this issue > > {code:java} > case class Student(name: String) > class NonSerializable() { > def getText() : String = { > """ > |[ > |{ > | "name": "test1" > |}, > |{ > | "name": "test2" > |} > |] > |""".stripMargin > } > } > val obj = new NonSerializable() > val descriptors_string = obj.getText() > import com.github.plokhotnyuk.jsoniter_scala.macros._ > import com.github.plokhotnyuk.jsoniter_scala.core._ > val parsed_descriptors: Array[Student] = > readFromString[Array[Student]](descriptors_string)(JsonCodecMaker.make) > val broadcast_descriptors = spark.sparkContext.broadcast(parsed_descriptors) > def foo(data: String): Seq[Any] = { > import scala.collection.mutable.ArrayBuffer > val res = new ArrayBuffer[String]() > for (desc <- broadcast_descriptors.value) { > res += desc.name > } > res > } > val data = spark.sparkContext.parallelize(Array("1", "2", "3")).map(x => > foo(x)).collect() > {code} > Command used to start spark-shell > {code:java} > ./bin/spark-shell --master local --jars > /Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-macros_2.12/2.7.1/jsoniter-scala-macros_2.12-2.7.1.jar,/Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-core_2.12/2.7.1/jsoniter-scala-core_2.12-2.7.1.jar > {code} > *Exception Details* > {code:java} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2362) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.map(RDD.scala:395) > ... 51 elided > Caused by: java.io.NotSerializableException: NonSerializable > Serialization stack: > - object not serializable (class: NonSerializable, value: > NonSerializable@7c95440d) > - field (class: $iw, name: obj, type: class NonSerializable) > - object (class $iw, $iw@3ed476a8) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@381f7b8e) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@23635e39) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@1a8b3791) > - field (class: $iw, name: $iw, type: class $iw) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35272) org.apache.spark.SparkException: Task not serializable
Sandeep Katta created SPARK-35272: - Summary: org.apache.spark.SparkException: Task not serializable Key: SPARK-35272 URL: https://issues.apache.org/jira/browse/SPARK-35272 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1 Reporter: Sandeep Katta Attachments: ExceptionStack.txt I am getting NotSerializableException when broadcasting Serialized class. Minimal code with which you can reproduce this issue {code:java} case class Student(name: String) class NonSerializable() { def getText() : String = { """ |[ |{ | "name": "test1" |}, |{ | "name": "test2" |} |] |""".stripMargin } } val obj = new NonSerializable() val descriptors_string = obj.getText() import com.github.plokhotnyuk.jsoniter_scala.macros._ import com.github.plokhotnyuk.jsoniter_scala.core._ val parsed_descriptors: Array[Student] = readFromString[Array[Student]](descriptors_string)(JsonCodecMaker.make) val broadcast_descriptors = spark.sparkContext.broadcast(parsed_descriptors) def foo(data: String): Seq[Any] = { import scala.collection.mutable.ArrayBuffer val res = new ArrayBuffer[String]() for (desc <- broadcast_descriptors.value) { res += desc.name } res } val data = spark.sparkContext.parallelize(Array("1", "2", "3")).map(x => foo(x)).collect() {code} Command used to start spark-shell {code:java} ./bin/spark-shell --master local --jars /Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-macros_2.12/2.7.1/jsoniter-scala-macros_2.12-2.7.1.jar,/Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-core_2.12/2.7.1/jsoniter-scala-core_2.12-2.7.1.jar {code} *Exception Details* {code:java} org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at org.apache.spark.SparkContext.clean(SparkContext.scala:2362) at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.map(RDD.scala:395) ... 51 elided Caused by: java.io.NotSerializableException: NonSerializable Serialization stack: - object not serializable (class: NonSerializable, value: NonSerializable@7c95440d) - field (class: $iw, name: obj, type: class NonSerializable) - object (class $iw, $iw@3ed476a8) - field (class: $iw, name: $iw, type: class $iw) - object (class $iw, $iw@381f7b8e) - field (class: $iw, name: $iw, type: class $iw) - object (class $iw, $iw@23635e39) - field (class: $iw, name: $iw, type: class $iw) - object (class $iw, $iw@1a8b3791) - field (class: $iw, name: $iw, type: class $iw) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35272) org.apache.spark.SparkException: Task not serializable
[ https://issues.apache.org/jira/browse/SPARK-35272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-35272: -- Attachment: ExceptionStack.txt > org.apache.spark.SparkException: Task not serializable > -- > > Key: SPARK-35272 > URL: https://issues.apache.org/jira/browse/SPARK-35272 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Sandeep Katta >Priority: Major > Attachments: ExceptionStack.txt > > > I am getting NotSerializableException when broadcasting Serialized class. > > Minimal code with which you can reproduce this issue > > {code:java} > case class Student(name: String) > class NonSerializable() { > def getText() : String = { > """ > |[ > |{ > | "name": "test1" > |}, > |{ > | "name": "test2" > |} > |] > |""".stripMargin > } > } > val obj = new NonSerializable() > val descriptors_string = obj.getText() > import com.github.plokhotnyuk.jsoniter_scala.macros._ > import com.github.plokhotnyuk.jsoniter_scala.core._ > val parsed_descriptors: Array[Student] = > readFromString[Array[Student]](descriptors_string)(JsonCodecMaker.make) > val broadcast_descriptors = spark.sparkContext.broadcast(parsed_descriptors) > def foo(data: String): Seq[Any] = { > import scala.collection.mutable.ArrayBuffer > val res = new ArrayBuffer[String]() > for (desc <- broadcast_descriptors.value) { > res += desc.name > } > res > } > val data = spark.sparkContext.parallelize(Array("1", "2", "3")).map(x => > foo(x)).collect() > {code} > Command used to start spark-shell > {code:java} > ./bin/spark-shell --master local --jars > /Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-macros_2.12/2.7.1/jsoniter-scala-macros_2.12-2.7.1.jar,/Users/sandeep.katta/.m2/repository/com/github/plokhotnyuk/jsoniter-scala/jsoniter-scala-core_2.12/2.7.1/jsoniter-scala-core_2.12-2.7.1.jar > {code} > *Exception Details* > {code:java} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2362) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.map(RDD.scala:395) > ... 51 elided > Caused by: java.io.NotSerializableException: NonSerializable > Serialization stack: > - object not serializable (class: NonSerializable, value: > NonSerializable@7c95440d) > - field (class: $iw, name: obj, type: class NonSerializable) > - object (class $iw, $iw@3ed476a8) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@381f7b8e) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@23635e39) > - field (class: $iw, name: $iw, type: class $iw) > - object (class $iw, $iw@1a8b3791) > - field (class: $iw, name: $iw, type: class $iw) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322360#comment-17322360 ] Sandeep Katta commented on SPARK-35096: --- Working on fix, soon raise PR for this > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.Wrap
[jira] [Created] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
Sandeep Katta created SPARK-35096: - Summary: foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive Key: SPARK-35096 URL: https://issues.apache.org/jira/browse/SPARK-35096 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Sandeep Katta Below code works fine before spark3, running on spark3 throws java.lang.ArrayIndexOutOfBoundsException {code:java} val inputPath = "/Users/xyz/data/testcaseInsensitivity" val output_path = "/Users/xyz/output" spark.range(10).write.format("parquet").save(inputPath) def process_row(microBatch: DataFrame, batchId: Long): Unit = { val df = microBatch.select($"ID".alias("other")) // Doesn't work df.write.format("parquet").mode("append").save(output_path) } val schema = new StructType().add("id", LongType) val stream_df = spark.readStream.schema(schema).format("parquet").load(inputPath) stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) .start().awaitTermination() {code} Stack Trace: {code:java} Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:38) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:146) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:138) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:138) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.sca
[jira] [Comment Edited] (SPARK-21564) TaskDescription decoding failure should fail the task
[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294347#comment-17294347 ] Sandeep Katta edited comment on SPARK-21564 at 3/3/21, 7:47 AM: Recently I have hit with decode error, irony is all the tasks in the same taskset were able to deserialize but only 1 task is failed . Which says that the data is corrupted, most of the times it will be very difficult to analyze why the data is corrupted , so for these kind of intermittent issue exception handling should be in place to achieve fault tolerant {code:java} 21/02/11 07:53:39 ERROR Inbox: Ignoring errorjava.io.UTFDataFormatException: malformed input around byte 5 at java.io.DataInputStream.readUTF(DataInputStream.java:656) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$deserializeStringLongMap$1.apply$mcVI$sp(TaskDescription.scala:110) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.deserializeStringLongMap(TaskDescription.scala:109) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:125) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:100) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:226) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} !image-2021-03-03-13-02-31-744.png! So it's better to have fault tolerant in place, Spark Driver does not have any idea about this exception so it still waits for this task to complete, thus the job is in zombie stage CC [~dongjoon] [~hyukjin.kwon] [~cloud_fan] tagging you guys for more traction was (Author: sandeep.katta2007): Recently I have hit with decode error, irony is all the tasks in the same taskset were able to deserialize but only 1 task is failed . Which says that the data is corrupted, most of the times it will be very difficult to analyze why the data is corrupted , so for these kind of intermittent issue exception handling should be in place to achieve fault tolerant *21/02/11 07:53:39 ERROR Inbox: Ignoring errorjava.io.UTFDataFormatException: malformed input around byte 5 at* java.io.DataInputStream.readUTF(DataInputStream.java:656) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$deserializeStringLongMap$1.apply$mcVI$sp(TaskDescription.scala:110) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.deserializeStringLongMap(TaskDescription.scala:109) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:125) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:100) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:226) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) !image-2021-03-03-13-02-31-744.png! So it's better to have fault tolerant in place, Spark Driver does not have any idea about this exception so it still waits for this task to complete, thus the job is in zombie stage CC [~dongjoon] [~hyukjin.kwon] [~cloud_fan] tagging you guys for more traction > TaskDescription decoding failure should fail the task > - > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.4.7, 3.1.1 >Reporter: Andrew Ash >Priority: Major > Labels: bulk-closed > Attachments: image-2021-03-03-13-02-06-669.png, > image-2021-03-03-13-02-31-744.png > > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at
[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task
[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-21564: -- Affects Version/s: 2.4.7 3.1.1 > TaskDescription decoding failure should fail the task > - > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.4.7, 3.1.1 >Reporter: Andrew Ash >Priority: Major > Labels: bulk-closed > Attachments: image-2021-03-03-13-02-06-669.png, > image-2021-03-03-13-02-31-744.png > > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > For details on the cause of that exception, see SPARK-21563 > We've since changed the application and have a proposed fix in Spark at the > ticket above, but it was troubling that decoding the TaskDescription wasn't > failing the tasks. So the Spark job ended up hanging and making no progress, > permanently stuck, because the driver thinks the task is running but the > thread has died in the executor. > We should make a change around > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96 > so that when that decode throws an exception, the task is marked as failed. > cc [~kayousterhout] [~irashid] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21564) TaskDescription decoding failure should fail the task
[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294347#comment-17294347 ] Sandeep Katta commented on SPARK-21564: --- Recently I have hit with decode error, irony is all the tasks in the same taskset were able to deserialize but only 1 task is failed . Which says that the data is corrupted, most of the times it will be very difficult to analyze why the data is corrupted , so for these kind of intermittent issue exception handling should be in place to achieve fault tolerant *21/02/11 07:53:39 ERROR Inbox: Ignoring errorjava.io.UTFDataFormatException: malformed input around byte 5 at* java.io.DataInputStream.readUTF(DataInputStream.java:656) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$deserializeStringLongMap$1.apply$mcVI$sp(TaskDescription.scala:110) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.deserializeStringLongMap(TaskDescription.scala:109) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:125) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:100) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:226) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) !image-2021-03-03-13-02-31-744.png! So it's better to have fault tolerant in place, Spark Driver does not have any idea about this exception so it still waits for this task to complete, thus the job is in zombie stage CC [~dongjoon] [~hyukjin.kwon] [~cloud_fan] tagging you guys for more traction > TaskDescription decoding failure should fail the task > - > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Major > Labels: bulk-closed > Attachments: image-2021-03-03-13-02-06-669.png, > image-2021-03-03-13-02-31-744.png > > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > For details on the cause of that exception, see SPARK-21563 > We've since changed the application and have a proposed fix in Spark at the > ticket above, but it was troubling that decoding the TaskDescription wasn't > failing the tasks. So the Spark job ended up hanging and making no progress, > permanently stuck, because the driver thinks the task is running but the > thread has died in the executor. > We should make a change around > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96 > so that when that decode throws an exception, the task is marked as failed. > cc [~kayousterhout] [~irashid] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task
[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-21564: -- Attachment: image-2021-03-03-13-02-31-744.png > TaskDescription decoding failure should fail the task > - > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Major > Labels: bulk-closed > Attachments: image-2021-03-03-13-02-06-669.png, > image-2021-03-03-13-02-31-744.png > > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > For details on the cause of that exception, see SPARK-21563 > We've since changed the application and have a proposed fix in Spark at the > ticket above, but it was troubling that decoding the TaskDescription wasn't > failing the tasks. So the Spark job ended up hanging and making no progress, > permanently stuck, because the driver thinks the task is running but the > thread has died in the executor. > We should make a change around > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96 > so that when that decode throws an exception, the task is marked as failed. > cc [~kayousterhout] [~irashid] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task
[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-21564: -- Attachment: image-2021-03-03-13-02-06-669.png > TaskDescription decoding failure should fail the task > - > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Major > Labels: bulk-closed > Attachments: image-2021-03-03-13-02-06-669.png, > image-2021-03-03-13-02-31-744.png > > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > For details on the cause of that exception, see SPARK-21563 > We've since changed the application and have a proposed fix in Spark at the > ticket above, but it was troubling that decoding the TaskDescription wasn't > failing the tasks. So the Spark job ended up hanging and making no progress, > permanently stuck, because the driver thinks the task is running but the > thread has died in the executor. > We should make a change around > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96 > so that when that decode throws an exception, the task is marked as failed. > cc [~kayousterhout] [~irashid] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
[ https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190626#comment-17190626 ] Sandeep Katta commented on SPARK-32779: --- [~bersprockets] thanks for raising this issue, I am working on this. I will raise the patch soon > Spark/Hive3 interaction potentially causes deadlock > --- > > Key: SPARK-32779 > URL: https://issues.apache.org/jira/browse/SPARK-32779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > This is an issue for applications that share a Spark Session across multiple > threads. > sessionCatalog.loadPartition (after checking that the table exists) grabs > locks in this order: > - HiveExternalCatalog > - HiveSessionCatalog (in Shim_v3_0) > Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: > - HiveSessionCatalog > - HiveExternalCatalog > [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] > appears to be the culprit. Maybe db name should be defaulted _before_ the > call to HiveClient so that Shim_v3_0 doesn't have to call back into > SessionCatalog. Or possibly this is not needed at all, since loadPartition in > Shim_v2_1 doesn't worry about the default db name, but that might be because > of differences between Hive client libraries. > Reproduction case: > - You need to have a running Hive 3.x HMS instance and the appropriate > hive-site.xml for your Spark instance > - Adjust your spark.sql.hive.metastore.version accordingly > - It might take more than one try to hit the deadlock > Launch Spark: > {noformat} > bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" > --conf spark.sql.hive.metastore.version=3.1 > {noformat} > Then use the following code: > {noformat} > import scala.collection.mutable.ArrayBuffer > import scala.util.Random > val tableCount = 4 > for (i <- 0 until tableCount) { > val tableName = s"partitioned${i+1}" > sql(s"drop table if exists $tableName") > sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored > as orc") > } > val threads = new ArrayBuffer[Thread] > for (i <- 0 until tableCount) { > threads.append(new Thread( new Runnable { > override def run: Unit = { > val tableName = s"partitioned${i + 1}" > val rand = Random > val df = spark.range(0, 2).toDF("a") > val location = s"/tmp/${rand.nextLong.abs}" > df.write.mode("overwrite").orc(location) > sql( > s""" > LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition > (b=$i)""") > } > }, s"worker$i")) > threads(i).start() > } > for (i <- 0 until tableCount) { > println(s"Joining with thread $i") > threads(i).join() > } > println("All done") > {noformat} > The job often gets stuck after one or two "Joining..." lines. > {{kill -3}} shows something like this: > {noformat} > Found one Java-level deadlock: > = > "worker3": > waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a > org.apache.spark.sql.hive.HiveSessionCatalog), > which is held by "worker0" > "worker0": > waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a > org.apache.spark.sql.hive.HiveExternalCatalog), > which is held by "worker3" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190333#comment-17190333 ] Sandeep Katta commented on SPARK-32764: --- So it's edge case scenario, how should we fix it or we can leave it as limitation ? > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190331#comment-17190331 ] Sandeep Katta edited comment on SPARK-32764 at 9/3/20, 5:47 PM: [~cloud_fan] and [~smilegator] I did some analysis on this, it is because as a part of SPARK-30009 org.apache.spark.util.Utils.nanSafeCompareDoubles is replaced with java.lang.Double.compare. Same can be confirmed from the codegen code, I have attached those for referrence. If you see the implementation of java.lang.Double.compare *java.lang.Double.compare(-0.0, 0.0) < 0 evaluates to true* *java.lang.Double.compare(0.0, -0.0) < 0 evaluates to false* was (Author: sandeep.katta2007): [~cloud_fan] and [~smilegator] I did some analysis on this, it is because as a part of SPARK-30009 org.apache.spark.util.Utils.nanSafeCompareDoubles is replaced with java.lang.Double.compare. Same can be confirmed from the codegen code, I have attached those for referrence. If you see the implementation of java.lang.Double.compare *java.lang.Double.compare(-0.0, 0.0) < 0 evaluates to true* *java.lang.Double.compare(0.0, -0.0) < 0 evaluates to false* > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190331#comment-17190331 ] Sandeep Katta edited comment on SPARK-32764 at 9/3/20, 5:45 PM: [~cloud_fan] and [~smilegator] I did some analysis on this, it is because as a part of SPARK-30009 org.apache.spark.util.Utils.nanSafeCompareDoubles is replaced with java.lang.Double.compare. Same can be confirmed from the codegen code, I have attached those for referrence. If you see the implementation of java.lang.Double.compare *java.lang.Double.compare(-0.0, 0.0) < 0 evaluates to true* *java.lang.Double.compare(0.0, -0.0) < 0 evaluates to false* was (Author: sandeep.katta2007): [~cloud_fan] and [~smilegator] I did some analysis on this, it is because as a part of SPARK-30009 org.apache.spark.util.Utils.nanSafeCompareDoubles is replaced with java.lang.Double.compare. Same can be confirmed from the codegen code > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190331#comment-17190331 ] Sandeep Katta commented on SPARK-32764: --- [~cloud_fan] and [~smilegator] I did some analysis on this, it is because as a part of SPARK-30009 org.apache.spark.util.Utils.nanSafeCompareDoubles is replaced with java.lang.Double.compare. Same can be confirmed from the codegen code > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data
[ https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167644#comment-17167644 ] Sandeep Katta commented on SPARK-29314: --- ya my bad, not required to backport > ProgressReporter.extractStateOperatorMetrics should not overwrite updated as > 0 when it actually runs a batch even with no data > -- > > Key: SPARK-29314 > URL: https://issues.apache.org/jira/browse/SPARK-29314 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > SPARK-24156 brought the ability to run a batch without actual data to enable > fast state cleanup as well as emit evicted outputs without waiting actual > data to come. > This breaks some assumption on > `ProgressReporter.extractStateOperatorMetrics`. See comment in source code: > {code:java} > // lastExecution could belong to one of the previous triggers if > `!hasNewData`. > // Walking the plan again should be inexpensive. > {code} > and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense > if we copy progress from previous execution (which means no batch is run for > this time), but after SPARK-24156 the precondition is broken. > Spark should still replace the value of newNumRowsUpdated with 0 if there's > no batch being run and it needs to copy the old value from previous > execution, but it shouldn't touch the value if it runs a batch for no data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data
[ https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167243#comment-17167243 ] Sandeep Katta commented on SPARK-29314: --- [~kabhwan] [~brkyvz] this is required to backport to 2.4 branch > ProgressReporter.extractStateOperatorMetrics should not overwrite updated as > 0 when it actually runs a batch even with no data > -- > > Key: SPARK-29314 > URL: https://issues.apache.org/jira/browse/SPARK-29314 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > SPARK-24156 brought the ability to run a batch without actual data to enable > fast state cleanup as well as emit evicted outputs without waiting actual > data to come. > This breaks some assumption on > `ProgressReporter.extractStateOperatorMetrics`. See comment in source code: > {code:java} > // lastExecution could belong to one of the previous triggers if > `!hasNewData`. > // Walking the plan again should be inexpensive. > {code} > and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense > if we copy progress from previous execution (which means no batch is run for > this time), but after SPARK-24156 the precondition is broken. > Spark should still replace the value of newNumRowsUpdated with 0 if there's > no batch being run and it needs to copy the old value from previous > execution, but it shouldn't touch the value if it runs a batch for no data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28054) Unable to insert partitioned table dynamically when partition name is upper case
[ https://issues.apache.org/jira/browse/SPARK-28054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117759#comment-17117759 ] Sandeep Katta commented on SPARK-28054: --- [~hyukjin.kwon] is there any reason why this PR is not backported to branch2.4 ? > Unable to insert partitioned table dynamically when partition name is upper > case > > > Key: SPARK-28054 > URL: https://issues.apache.org/jira/browse/SPARK-28054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: ChenKai >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > {code:java} > -- create sql and column name is upper case > CREATE TABLE src (KEY STRING, VALUE STRING) PARTITIONED BY (DS STRING) > -- insert sql > INSERT INTO TABLE src PARTITION(ds) SELECT 'k' key, 'v' value, '1' ds > {code} > The error is: > {code:java} > Error in query: > org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: > Partition spec {ds=, DS=1} contains non-partition columns; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113525#comment-17113525 ] Sandeep Katta commented on SPARK-31761: --- I have raised the Pull request [https://github.com/apache/spark/pull/28600] > Sql Div operator can result in incorrect output for int_min > --- > > Key: SPARK-31761 > URL: https://issues.apache.org/jira/browse/SPARK-31761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > > Input in csv : -2147483648,-1 --> (_c0, _c1) > {code} > val res = df.selectExpr("_c0 div _c1") > res.collect > res1: Array[org.apache.spark.sql.Row] = Array([-2147483648]) > {code} > The result should be 2147483648 instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113453#comment-17113453 ] Sandeep Katta commented on SPARK-31761: --- okay I will try to fix it > Sql Div operator can result in incorrect output for int_min > --- > > Key: SPARK-31761 > URL: https://issues.apache.org/jira/browse/SPARK-31761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > > Input in csv : -2147483648,-1 --> (_c0, _c1) > {code} > val res = df.selectExpr("_c0 div _c1") > res.collect > res1: Array[org.apache.spark.sql.Row] = Array([-2147483648]) > {code} > The result should be 2147483648 instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113344#comment-17113344 ] Sandeep Katta commented on SPARK-31761: --- but to fix this do we need to revert https://issues.apache.org/jira/browse/SPARK-16323 or we just cast input to long and divide ? > Sql Div operator can result in incorrect output for int_min > --- > > Key: SPARK-31761 > URL: https://issues.apache.org/jira/browse/SPARK-31761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > > Input in csv : -2147483648,-1 --> (_c0, _c1) > {code} > val res = df.selectExpr("_c0 div _c1") > res.collect > res1: Array[org.apache.spark.sql.Row] = Array([-2147483648]) > {code} > The result should be 2147483648 instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113337#comment-17113337 ] Sandeep Katta commented on SPARK-31761: --- I can fix this, I will raise PR > Sql Div operator can result in incorrect output for int_min > --- > > Key: SPARK-31761 > URL: https://issues.apache.org/jira/browse/SPARK-31761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > > Input in csv : -2147483648,-1 --> (_c0, _c1) > {code} > val res = df.selectExpr("_c0 div _c1") > res.collect > res1: Array[org.apache.spark.sql.Row] = Array([-2147483648]) > {code} > The result should be 2147483648 instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112368#comment-17112368 ] Sandeep Katta edited comment on SPARK-31761 at 5/20/20, 3:46 PM: - [~sowen] [~hyukjin.kwon] [~dongjoon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4|#4] +- Relation[col0#0,col1#1|#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *Spark-3.0 Plan* == Parsed Logical Plan == 'Project [('col0 div 'col1) AS (col0 div col1)#4|#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Analyzed Logical Plan == (col0 div col1): int Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Optimized Logical Plan == Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Physical Plan == *(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- BatchScan[col0#0, col1#1|#0, col1#1] CSVScan Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], ReadSchema: struct In Spark3 do I need to cast the columns as in spark-2.4, or user should manually add cast to their query as per below example val schema = "col0 int,col1 int"; val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv"); val res = df.selectExpr("col0 div col1") val res = df.selectExpr("Cast(col0 as Decimal) div col1 ") res.collect please let us know your opinion was (Author: sandeep.katta2007): [~sowen] [~hyukjin.kwon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4|#4] +- Relation[col0#0,col1#1|#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) P
[jira] [Comment Edited] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112368#comment-17112368 ] Sandeep Katta edited comment on SPARK-31761 at 5/20/20, 3:44 PM: - [~sowen] [~hyukjin.kwon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4|#4] +- Relation[col0#0,col1#1|#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *Spark-3.0 Plan* == Parsed Logical Plan == 'Project [('col0 div 'col1) AS (col0 div col1)#4|#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Analyzed Logical Plan == (col0 div col1): int Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Optimized Logical Plan == Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Physical Plan == *(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- BatchScan[col0#0, col1#1|#0, col1#1] CSVScan Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], ReadSchema: struct In Spark3 do I need to cast the columns as in spark-2.4, or user should manually add cast to their query as per below example val schema = "col0 int,col1 int"; val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv"); val res = df.selectExpr("col0 div col1") val res = df.selectExpr("Cast(col0 as Decimal) div col1 ") res.collect please let us know your opinion was (Author: sandeep.katta2007): [~sowen] [~hyukjin.kwon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4] +- Relation[col0#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct Spark-3.0 == Parsed Logical Plan == 'Project [('col0 div 'c
[jira] [Commented] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112368#comment-17112368 ] Sandeep Katta commented on SPARK-31761: --- [~sowen] [~hyukjin.kwon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4] +- Relation[col0#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct Spark-3.0 == Parsed Logical Plan == 'Project [('col0 div 'col1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv == Analyzed Logical Plan == (col0 div col1): int Project [(col0#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv == Optimized Logical Plan == Project [(col0#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv == Physical Plan == *(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4] +- BatchScan[col0#0, col1#1] CSVScan Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], ReadSchema: struct In Spark3 do I need to cast the columns as in spark-2.4, or user should manually add cast to their query as per below example val schema = "col0 int,col1 int"; val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv"); val res = df.selectExpr("col0 div col1") val res = df.selectExpr("Cast(col0 as Decimal) div col1 ") res.collect please let us know your opinion > Sql Div operator can result in incorrect output for int_min > --- > > Key: SPARK-31761 > URL: https://issues.apache.org/jira/browse/SPARK-31761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > > Input in csv : -2147483648,-1 --> (_c0, _c1) > {code} > val res = df.selectExpr("_c0 div _c1") > res.collect > res1: Array[org.apache.spark.sql.Row] = Array([-2147483648]) > {code} > The result should be 2147483648 instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111294#comment-17111294 ] Sandeep Katta commented on SPARK-31761: --- [~kshukla] thanks for raising this, I cross checked this in 2.4.4 it seems to work fine. I will analyse w.r.t 3.0.0 and raise PR soon . > Sql Div operator can result in incorrect output for int_min > --- > > Key: SPARK-31761 > URL: https://issues.apache.org/jira/browse/SPARK-31761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > > Input in csv : -2147483648,-1 --> (_c0, _c1) > {code} > val res = df.selectExpr("_c0 div _c1") > res.collect > res1: Array[org.apache.spark.sql.Row] = Array([-2147483648]) > {code} > The result should be 2147483648 instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31617) support drop multiple functions
[ https://issues.apache.org/jira/browse/SPARK-31617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100451#comment-17100451 ] Sandeep Katta commented on SPARK-31617: --- [~hyukjin.kwon] if this is required by spark community I can work on this , what's your suggesstion ? > support drop multiple functions > --- > > Key: SPARK-31617 > URL: https://issues.apache.org/jira/browse/SPARK-31617 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > > postgresql support dropping of multiple functions in one command.Better spark > sql also can support this. > > [https://www.postgresql.org/docs/12/sql-dropfunction.html] > Drop multiple functions in one command: > DROP FUNCTION sqrt(integer), sqrt(bigint); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077939#comment-17077939 ] Sandeep Katta commented on SPARK-23128: --- [~cloud_fan] [~carsonwang] any updates on dynamic parallelism and skew Handling. Is it fixed in 3.0.0 > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30096) drop function does not delete the jars file from tmp folder- session is not getting clear
[ https://issues.apache.org/jira/browse/SPARK-30096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044353#comment-17044353 ] Sandeep Katta commented on SPARK-30096: --- [~abhishek.akg] I have cross checked the behaviour, those temp files will be deleted once the session is closed. I double checked the code and found the same. Can you please recheck after existing the session ? Code which cleans the temp directory !screenshot-1.png! > drop function does not delete the jars file from tmp folder- session is not > getting clear > - > > Key: SPARK-30096 > URL: https://issues.apache.org/jira/browse/SPARK-30096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: screenshot-1.png > > > Steps: > 1. {{spark-master --master yarn}} > 2. {{spark-sql> create function addDoubles AS > 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar > 'hdfs://hacluster/user/user1/AddDoublesUDF.jar';}} > 3. {{spark-sql> select addDoubles (1,2);}} > In the console log user can see > Added > [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to > class path > {code} > 19/12/02 13:49:33 INFO SessionState: Added > [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to > class path > {code} > 4. {{spark-sql>drop function AddDoubles;}} > Check the tmp folder still session is not clear > {code:java} > vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll > total 11696 > -rw-r--r-- 1 root root92660 Dec 2 13:49 .AddDoublesUDF.jar.crc > -rwxr-xr-x 1 root root 11859263 Dec 2 13:49 AddDoublesUDF.jar > vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30096) drop function does not delete the jars file from tmp folder- session is not getting clear
[ https://issues.apache.org/jira/browse/SPARK-30096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-30096: -- Attachment: screenshot-1.png > drop function does not delete the jars file from tmp folder- session is not > getting clear > - > > Key: SPARK-30096 > URL: https://issues.apache.org/jira/browse/SPARK-30096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: screenshot-1.png > > > Steps: > 1. {{spark-master --master yarn}} > 2. {{spark-sql> create function addDoubles AS > 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar > 'hdfs://hacluster/user/user1/AddDoublesUDF.jar';}} > 3. {{spark-sql> select addDoubles (1,2);}} > In the console log user can see > Added > [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to > class path > {code} > 19/12/02 13:49:33 INFO SessionState: Added > [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to > class path > {code} > 4. {{spark-sql>drop function AddDoubles;}} > Check the tmp folder still session is not clear > {code:java} > vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll > total 11696 > -rw-r--r-- 1 root root92660 Dec 2 13:49 .AddDoublesUDF.jar.crc > -rwxr-xr-x 1 root root 11859263 Dec 2 13:49 AddDoublesUDF.jar > vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30133) Support DELETE Jar and DELETE File functionality in spark
[ https://issues.apache.org/jira/browse/SPARK-30133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014328#comment-17014328 ] Sandeep Katta commented on SPARK-30133: --- [~hyukjin.kwon] I have raised the PR for these, Please help me to review it. > Support DELETE Jar and DELETE File functionality in spark > - > > Key: SPARK-30133 > URL: https://issues.apache.org/jira/browse/SPARK-30133 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > Labels: Umbrella > > SPARK should support delete jar feature > This feature aims at solving below use case. > Currently in spark add jar API supports to add the jar to executor and Driver > ClassPath at runtime, if there is any change in this jar definition there is > no way user can update the jar to executor and Driver classPath. User needs > to restart the application to solve this problem which is costly operation. > After this JIRA fix user can use delete jar API to remove the jar from Driver > and Executor ClassPath without the need of restarting the any spark > application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30133) Support DELETE Jar and DELETE File functionality in spark
[ https://issues.apache.org/jira/browse/SPARK-30133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-30133: -- Description: SPARK should support delete jar feature This feature aims at solving below use case. Currently in spark add jar API supports to add the jar to executor and Driver ClassPath at runtime, if there is any change in this jar definition there is no way user can update the jar to executor and Driver classPath. User needs to restart the application to solve this problem which is costly operation. After this JIRA fix user can use delete jar API to remove the jar from Driver and Executor ClassPath without the need of restarting the any spark application. was:SPARK should support delete jar feature > Support DELETE Jar and DELETE File functionality in spark > - > > Key: SPARK-30133 > URL: https://issues.apache.org/jira/browse/SPARK-30133 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > Labels: Umbrella > > SPARK should support delete jar feature > This feature aims at solving below use case. > Currently in spark add jar API supports to add the jar to executor and Driver > ClassPath at runtime, if there is any change in this jar definition there is > no way user can update the jar to executor and Driver classPath. User needs > to restart the application to solve this problem which is costly operation. > After this JIRA fix user can use delete jar API to remove the jar from Driver > and Executor ClassPath without the need of restarting the any spark > application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning
[ https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010408#comment-17010408 ] Sandeep Katta commented on SPARK-30425: --- [~gengliang] > FileScan of Data Source V2 doesn't implement Partition Pruning > -- > > Key: SPARK-30425 > URL: https://issues.apache.org/jira/browse/SPARK-30425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Haifeng Chen >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > I was trying to understand how Data Source V2 handling partition pruning, I > didn't find the code anywhere which filtering out the unnecessary files in > current Data Source V2 implementation. For a File data source, the base class > FileScan of Data Source V2 possibly should handle this in "partitions" > method. But the current implementation is like the following: > protected def partitions: Seq[FilePartition] = { > val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty) > > listFiles passed to empty sequence where no files will be filtered by the > partition filter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning
[ https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010407#comment-17010407 ] Sandeep Katta commented on SPARK-30425: --- is this duplicate of [SPARK-30428|https://issues.apache.org/jira/browse/SPARK-30428] > FileScan of Data Source V2 doesn't implement Partition Pruning > -- > > Key: SPARK-30425 > URL: https://issues.apache.org/jira/browse/SPARK-30425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Haifeng Chen >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > I was trying to understand how Data Source V2 handling partition pruning, I > didn't find the code anywhere which filtering out the unnecessary files in > current Data Source V2 implementation. For a File data source, the base class > FileScan of Data Source V2 possibly should handle this in "partitions" > method. But the current implementation is like the following: > protected def partitions: Seq[FilePartition] = { > val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty) > > listFiles passed to empty sequence where no files will be filtered by the > partition filter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30362) InputMetrics are not updated for DataSourceRDD V2
[ https://issues.apache.org/jira/browse/SPARK-30362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-30362: -- Attachment: inputMetrics.png > InputMetrics are not updated for DataSourceRDD V2 > -- > > Key: SPARK-30362 > URL: https://issues.apache.org/jira/browse/SPARK-30362 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > Attachments: inputMetrics.png > > > InputMetrics is not updated for DataSourceRDD -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30362) InputMetrics are not updated for DataSourceRDD V2
Sandeep Katta created SPARK-30362: - Summary: InputMetrics are not updated for DataSourceRDD V2 Key: SPARK-30362 URL: https://issues.apache.org/jira/browse/SPARK-30362 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Sandeep Katta InputMetrics is not updated for DataSourceRDD -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30333) Bump jackson-databind to 2.6.7.3
Sandeep Katta created SPARK-30333: - Summary: Bump jackson-databind to 2.6.7.3 Key: SPARK-30333 URL: https://issues.apache.org/jira/browse/SPARK-30333 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Sandeep Katta To fix below CVE CVE-2018-14718 CVE-2018-14719 CVE-2018-14720 CVE-2018-14721 CVE-2018-19360, CVE-2018-19361 CVE-2018-19362 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30318) Bump jetty to 9.3.27.v20190418
Sandeep Katta created SPARK-30318: - Summary: Bump jetty to 9.3.27.v20190418 Key: SPARK-30318 URL: https://issues.apache.org/jira/browse/SPARK-30318 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Sandeep Katta Upgrade jetty to 9.3.27.v20190418 to fix CVE-2019-10241 and CVE-2019-10247 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000782#comment-17000782 ] Sandeep Katta commented on SPARK-27021: --- [~hyukjin.kwon] [~dongjoon] this issue is not back ported to branch-2.4. Please check it is required or not > Leaking Netty event loop group for shuffle chunk fetch requests > --- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-12-14-23-23-50-384.png > > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30137) Support DELETE file
[ https://issues.apache.org/jira/browse/SPARK-30137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994263#comment-16994263 ] Sandeep Katta commented on SPARK-30137: --- yes, I working on this, will raise it by tomorrow > Support DELETE file > > > Key: SPARK-30137 > URL: https://issues.apache.org/jira/browse/SPARK-30137 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-30175) Eliminate warnings: part 5
[ https://issues.apache.org/jira/browse/SPARK-30175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-30175: -- Comment: was deleted (was: thanks for raising, will raise PR soon) > Eliminate warnings: part 5 > -- > > Key: SPARK-30175 > URL: https://issues.apache.org/jira/browse/SPARK-30175 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/WriteToMicroBatchDataSource.scala > {code:java} > Warning:Warning:line (36)class WriteToDataSourceV2 in package v2 is > deprecated (since 2.4.0): Use specific logical plans like AppendData instead > def createPlan(batchId: Long): WriteToDataSourceV2 = { > Warning:Warning:line (37)class WriteToDataSourceV2 in package v2 is > deprecated (since 2.4.0): Use specific logical plans like AppendData instead > WriteToDataSourceV2(new MicroBatchWrite(batchId, write), query) > {code} > sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala > {code:java} > Warning:Warning:line (703)a pure expression does nothing in statement > position; multiline expressions might require enclosing parentheses > q1 > {code} > sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala > {code:java} > Warning:Warning:line (285)object typed in package scalalang is deprecated > (since 3.0.0): please use untyped builtin aggregate functions. > val aggregated = > inputData.toDS().groupByKey(_._1).agg(typed.sumLong(_._2)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30175) Eliminate warnings: part 5
[ https://issues.apache.org/jira/browse/SPARK-30175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991146#comment-16991146 ] Sandeep Katta commented on SPARK-30175: --- thanks for raising, will raise PR soon > Eliminate warnings: part 5 > -- > > Key: SPARK-30175 > URL: https://issues.apache.org/jira/browse/SPARK-30175 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/WriteToMicroBatchDataSource.scala > sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala > sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30135) Add documentation for DELETE JAR and DELETE File command
[ https://issues.apache.org/jira/browse/SPARK-30135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-30135: -- Summary: Add documentation for DELETE JAR and DELETE File command (was: Add documentation for DELETE JAR command) > Add documentation for DELETE JAR and DELETE File command > > > Key: SPARK-30135 > URL: https://issues.apache.org/jira/browse/SPARK-30135 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30137) Support DELETE file
[ https://issues.apache.org/jira/browse/SPARK-30137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988508#comment-16988508 ] Sandeep Katta commented on SPARK-30137: --- I started working on this feature, I will raise the PR soon > Support DELETE file > > > Key: SPARK-30137 > URL: https://issues.apache.org/jira/browse/SPARK-30137 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30137) Support DELETE file
Sandeep Katta created SPARK-30137: - Summary: Support DELETE file Key: SPARK-30137 URL: https://issues.apache.org/jira/browse/SPARK-30137 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Sandeep Katta -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30133) Support DELETE Jar and DELETE File functionality in spark
[ https://issues.apache.org/jira/browse/SPARK-30133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-30133: -- Summary: Support DELETE Jar and DELETE File functionality in spark (was: Support DELETE Jar functionality in spark) > Support DELETE Jar and DELETE File functionality in spark > - > > Key: SPARK-30133 > URL: https://issues.apache.org/jira/browse/SPARK-30133 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > Labels: Umbrella > > SPARK should support delete jar feature -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30136) DELETE JAR should also remove the jar from executor classPath
Sandeep Katta created SPARK-30136: - Summary: DELETE JAR should also remove the jar from executor classPath Key: SPARK-30136 URL: https://issues.apache.org/jira/browse/SPARK-30136 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Sandeep Katta -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30136) DELETE JAR should also remove the jar from executor classPath
[ https://issues.apache.org/jira/browse/SPARK-30136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988504#comment-16988504 ] Sandeep Katta commented on SPARK-30136: --- I started working on this feature, I will raise the PR soon > DELETE JAR should also remove the jar from executor classPath > - > > Key: SPARK-30136 > URL: https://issues.apache.org/jira/browse/SPARK-30136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30134) DELETE JAR should remove from addedJars list and from classpath
[ https://issues.apache.org/jira/browse/SPARK-30134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988502#comment-16988502 ] Sandeep Katta commented on SPARK-30134: --- I started working on this feature, I will raise the PR soon > DELETE JAR should remove from addedJars list and from classpath > > > Key: SPARK-30134 > URL: https://issues.apache.org/jira/browse/SPARK-30134 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30135) Add documentation for DELETE JAR command
[ https://issues.apache.org/jira/browse/SPARK-30135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988503#comment-16988503 ] Sandeep Katta commented on SPARK-30135: --- I started working on this feature, I will raise the PR soon > Add documentation for DELETE JAR command > > > Key: SPARK-30135 > URL: https://issues.apache.org/jira/browse/SPARK-30135 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30135) Add documentation for DELETE JAR command
Sandeep Katta created SPARK-30135: - Summary: Add documentation for DELETE JAR command Key: SPARK-30135 URL: https://issues.apache.org/jira/browse/SPARK-30135 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.0.0 Reporter: Sandeep Katta -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30134) DELETE JAR should remove from addedJars list and from classpath
Sandeep Katta created SPARK-30134: - Summary: DELETE JAR should remove from addedJars list and from classpath Key: SPARK-30134 URL: https://issues.apache.org/jira/browse/SPARK-30134 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Sandeep Katta -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30133) Support DELETE Jar functionality in spark
Sandeep Katta created SPARK-30133: - Summary: Support DELETE Jar functionality in spark Key: SPARK-30133 URL: https://issues.apache.org/jira/browse/SPARK-30133 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Sandeep Katta SPARK should support delete jar feature -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30096) drop function does not delete the jars file from tmp folder- session is not getting clear
[ https://issues.apache.org/jira/browse/SPARK-30096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985862#comment-16985862 ] Sandeep Katta commented on SPARK-30096: --- [~abhishek.akg] thanks for raising this issue, I will analyze and raise the PR if require > drop function does not delete the jars file from tmp folder- session is not > getting clear > - > > Key: SPARK-30096 > URL: https://issues.apache.org/jira/browse/SPARK-30096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Steps: > 1.spark-master --master yarn > 2. spark-sql> create function addDoubles AS > 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar > 'hdfs://hacluster/user/user1/AddDoublesUDF.jar'; > 3. spark-sql> select addDoubles (1,2); > In the console log user can see > Added [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] > to class path > 19/12/02 13:49:33 INFO SessionState: Added > [/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources/AddDoublesUDF.jar] to > class path > 4. spark-sql>drop function AddDoubles; > Check the tmp folder still session is not clear > vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # ll > total 11696 > -rw-r--r-- 1 root root92660 Dec 2 13:49 .AddDoublesUDF.jar.crc > -rwxr-xr-x 1 root root 11859263 Dec 2 13:49 AddDoublesUDF.jar > vm1:/tmp/23090937-7314-43d8-a859-b35f42d37bdf_resources # -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29504) Tooltip not display for Job Description even it shows ellipsed
[ https://issues.apache.org/jira/browse/SPARK-29504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955677#comment-16955677 ] Sandeep Katta commented on SPARK-29504: --- It's invalid issue, spark already supports this feature. If you double click on column it will show full description.Refer the Jira [SPARK-27135|https://issues.apache.org/jira/browse/SPARK-27135] > Tooltip not display for Job Description even it shows ellipsed > --- > > Key: SPARK-29504 > URL: https://issues.apache.org/jira/browse/SPARK-29504 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: ToolTip JIRA.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952537#comment-16952537 ] Sandeep Katta commented on SPARK-29465: --- [~dongjoon] thank you for the review. I will raise PR soon > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951652#comment-16951652 ] Sandeep Katta commented on SPARK-29465: --- agreed, I will raise the PR soon > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 1.6.2, 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29465) Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951581#comment-16951581 ] Sandeep Katta commented on SPARK-29465: --- I see as per the comments in the [code |https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L212] behavior is correct. User can submit more spark applications in the same box. ping [~hyukjin.kwon] [~dongjoon] what's your suggesstion > Unable to configure SPARK UI (spark.ui.port) in spark yarn cluster mode. > - > > Key: SPARK-29465 > URL: https://issues.apache.org/jira/browse/SPARK-29465 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 1.6.2, 2.4.4 >Reporter: Vishwas Nalka >Priority: Major > > I'm trying to restrict the ports used by spark app which is launched in yarn > cluster mode. All ports (viz. driver, executor, blockmanager) could be > specified using the respective properties except the ui port. The spark app > is launched using JAVA code and setting the property spark.ui.port in > sparkConf doesn't seem to help. Even setting a JVM option > -Dspark.ui.port="some_port" does not spawn the UI is required port. > From the logs of the spark app, *_the property spark.ui.port is overridden > and the JVM property '-Dspark.ui.port=0' is set_* even though it is never set > to 0. > _(Run in Spark 1.6.2) From the logs ->_ > _command:LD_LIBRARY_PATH="/usr/hdp/2.6.4.0-91/hadoop/lib/native:$LD_LIBRARY_PATH" > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms4096m > -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.blockManager.port=9900' > '-Dspark.driver.port=9902' '-Dspark.fileserver.port=9903' > '-Dspark.broadcast.port=9904' '-Dspark.port.maxRetries=20' > '-Dspark.ui.port=0' '-Dspark.executor.port=9905'_ > _19/10/14 16:39:59 INFO Utils: Successfully started service 'SparkUI' on port > 35167.19/10/14 16:39:59 INFO SparkUI: Started SparkUI at_ > [_http://10.65.170.98:35167_|http://10.65.170.98:35167/] > Even tried using a *spark-submit command with --conf spark.ui.port* does > spawn UI in required port > {color:#172b4d}_(Run in Spark 2.4.4)_{color} > {color:#172b4d}_./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g > --executor-cores 1 --conf spark.ui.port=12345 --conf spark.driver.port=12340 > --queue default examples/jars/spark-examples_2.11-2.4.4.jar 10_{color} > _From the logs::_ > _19/10/15 00:04:05 INFO ui.SparkUI: Stopped Spark web UI at > [http://invrh74ace005.informatica.com:46622|http://invrh74ace005.informatica.com:46622/]_ > _command:{{JAVA_HOME}}/bin/java -server -Xmx2048m > -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.ui.port=0' 'Dspark.driver.port=12340' > -Dspark.yarn.app.container.log.dir= -XX:OnOutOfMemoryError='kill %p' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@invrh74ace005.informatica.com:12340 > --executor-id --hostname --cores 1 --app-id > application_1570992022035_0089 --user-class-path > [file:$PWD/__app__.jar1|file://%24pwd/__app__.jar1]>/stdout2>/stderr_ > > Looks like the application master override this and set a JVM property before > launch resulting in random UI port even though spark.ui.port is set by the > user. > In these links > # > [https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 214) > # > [https://github.com/cloudera/spark/blob/master/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] > (line 75) > I can see that the method _*run() in above files sets a system property > UI_PORT*_ and _*spark.ui.port respectively.*_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29453) Improve tooltip information for SQL tab
[ https://issues.apache.org/jira/browse/SPARK-29453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-29453: -- Comment: was deleted (was: working on this will raise PR soon) > Improve tooltip information for SQL tab > --- > > Key: SPARK-29453 > URL: https://issues.apache.org/jira/browse/SPARK-29453 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29453) Improve tooltip information for SQL tab
[ https://issues.apache.org/jira/browse/SPARK-29453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950364#comment-16950364 ] Sandeep Katta commented on SPARK-29453: --- working on this will raise PR soon > Improve tooltip information for SQL tab > --- > > Key: SPARK-29453 > URL: https://issues.apache.org/jira/browse/SPARK-29453 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29453) Improve tooltip information for SQL tab
Sandeep Katta created SPARK-29453: - Summary: Improve tooltip information for SQL tab Key: SPARK-29453 URL: https://issues.apache.org/jira/browse/SPARK-29453 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.0.0 Reporter: Sandeep Katta -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29452) Improve tootip information for storage tab
Sandeep Katta created SPARK-29452: - Summary: Improve tootip information for storage tab Key: SPARK-29452 URL: https://issues.apache.org/jira/browse/SPARK-29452 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.0.0 Reporter: Sandeep Katta -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29452) Improve tootip information for storage tab
[ https://issues.apache.org/jira/browse/SPARK-29452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950344#comment-16950344 ] Sandeep Katta commented on SPARK-29452: --- soon I will raise PR for this > Improve tootip information for storage tab > -- > > Key: SPARK-29452 > URL: https://issues.apache.org/jira/browse/SPARK-29452 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950261#comment-16950261 ] Sandeep Katta commented on SPARK-29435: --- [~XuanYuan] thank you for your comments, I will update the PR as per this changes > Spark 3 doesnt work with older shuffle service > -- > > Key: SPARK-29435 > URL: https://issues.apache.org/jira/browse/SPARK-29435 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 > Environment: Spark 3 from Sept 26, commit > 8beb736a00b004f97de7fcdf9ff09388d80fc548 > Spark 2.4.1 shuffle service in yarn >Reporter: koert kuipers >Priority: Major > > SPARK-27665 introduced a change to the shuffle protocol. It also introduced a > setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run > with old shuffle service. > However i have not gotten that to work. I have been testing with Spark 3 > master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. > The errors i see are for example on EMR: > {code} > Error occurred while fetching local blocks > java.nio.file.NoSuchFileException: > /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index > {code} > And on CDH5: > {code} > org.apache.spark.shuffle.FetchFailedException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.file.NoSuchFileException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) > at > org.apache.spark.storage.ShuffleBloc
[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950212#comment-16950212 ] Sandeep Katta commented on SPARK-29435: --- cc [~cloud_fan] [~XuanYuan] this patched is tested by [~koert] please help to review patch > Spark 3 doesnt work with older shuffle service > -- > > Key: SPARK-29435 > URL: https://issues.apache.org/jira/browse/SPARK-29435 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 > Environment: Spark 3 from Sept 26, commit > 8beb736a00b004f97de7fcdf9ff09388d80fc548 > Spark 2.4.1 shuffle service in yarn >Reporter: koert kuipers >Priority: Major > > SPARK-27665 introduced a change to the shuffle protocol. It also introduced a > setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run > with old shuffle service. > However i have not gotten that to work. I have been testing with Spark 3 > master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. > The errors i see are for example on EMR: > {code} > Error occurred while fetching local blocks > java.nio.file.NoSuchFileException: > /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index > {code} > And on CDH5: > {code} > org.apache.spark.shuffle.FetchFailedException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.file.NoSuchFileException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) > at > org.apache.spark.storage.Sh
[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949881#comment-16949881 ] Sandeep Katta commented on SPARK-29435: --- [~koert] can you please apply this [patch|https://github.com/apache/spark/pull/26095] and check ? > Spark 3 doesnt work with older shuffle service > -- > > Key: SPARK-29435 > URL: https://issues.apache.org/jira/browse/SPARK-29435 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 > Environment: Spark 3 from Sept 26, commit > 8beb736a00b004f97de7fcdf9ff09388d80fc548 > Spark 2.4.1 shuffle service in yarn >Reporter: koert kuipers >Priority: Major > > SPARK-27665 introduced a change to the shuffle protocol. It also introduced a > setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run > with old shuffle service. > However i have not gotten that to work. I have been testing with Spark 3 > master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. > The errors i see are for example on EMR: > {code} > Error occurred while fetching local blocks > java.nio.file.NoSuchFileException: > /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index > {code} > And on CDH5: > {code} > org.apache.spark.shuffle.FetchFailedException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.file.NoSuchFileException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) > at > org.apache.spark.sto
[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949522#comment-16949522 ] Sandeep Katta commented on SPARK-29435: --- Ya I am even able to reproduce it, I am looking into this issue. > Spark 3 doesnt work with older shuffle service > -- > > Key: SPARK-29435 > URL: https://issues.apache.org/jira/browse/SPARK-29435 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 > Environment: Spark 3 from Sept 26, commit > 8beb736a00b004f97de7fcdf9ff09388d80fc548 > Spark 2.4.1 shuffle service in yarn >Reporter: koert kuipers >Priority: Major > > SPARK-27665 introduced a change to the shuffle protocol. It also introduced a > setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run > with old shuffle service. > However i have not gotten that to work. I have been testing with Spark 3 > master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. > The errors i see are for example on EMR: > {code} > Error occurred while fetching local blocks > java.nio.file.NoSuchFileException: > /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index > {code} > And on CDH5: > {code} > org.apache.spark.shuffle.FetchFailedException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.file.NoSuchFileException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.
[jira] [Issue Comment Deleted] (SPARK-29435) Spark 3 doesnt work with older shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-29435: -- Comment: was deleted (was: [~koert] I will investigate this further, If I found root cause will raise PR soon) > Spark 3 doesnt work with older shuffle service > -- > > Key: SPARK-29435 > URL: https://issues.apache.org/jira/browse/SPARK-29435 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 > Environment: Spark 3 from Sept 26, commit > 8beb736a00b004f97de7fcdf9ff09388d80fc548 > Spark 2.4.1 shuffle service in yarn >Reporter: koert kuipers >Priority: Major > > SPARK-27665 introduced a change to the shuffle protocol. It also introduced a > setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run > with old shuffle service. > However i have not gotten that to work. I have been testing with Spark 3 > master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. > The errors i see are for example on EMR: > {code} > Error occurred while fetching local blocks > java.nio.file.NoSuchFileException: > /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index > {code} > And on CDH5: > {code} > org.apache.spark.shuffle.FetchFailedException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.file.NoSuchFileException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.(Shuffl
[jira] [Commented] (SPARK-29435) Spark 3 doesnt work with older shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949508#comment-16949508 ] Sandeep Katta commented on SPARK-29435: --- [~koert] I will investigate this further, If I found root cause will raise PR soon > Spark 3 doesnt work with older shuffle service > -- > > Key: SPARK-29435 > URL: https://issues.apache.org/jira/browse/SPARK-29435 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 > Environment: Spark 3 from Sept 26, commit > 8beb736a00b004f97de7fcdf9ff09388d80fc548 > Spark 2.4.1 shuffle service in yarn >Reporter: koert kuipers >Priority: Major > > SPARK-27665 introduced a change to the shuffle protocol. It also introduced a > setting spark.shuffle.useOldFetchProtocol which would allow spark 3 to run > with old shuffle service. > However i have not gotten that to work. I have been testing with Spark 3 > master (from Sept 26) and shuffle service from Spark 2.4.1 in yarn. > The errors i see are for example on EMR: > {code} > Error occurred while fetching local blocks > java.nio.file.NoSuchFileException: > /mnt1/yarn/usercache/hadoop/appcache/application_1570697024032_0058/blockmgr-d1d009b1-1c95-4e2a-9a71-0ff20078b9a8/38/shuffle_0_0_0.index > {code} > And on CDH5: > {code} > org.apache.spark.shuffle.FetchFailedException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:596) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:511) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:67) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:850) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:850) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.nio.file.NoSuchFileException: > /data/9/hadoop/nm/usercache/koert/appcache/application_1568061697664_8250/blockmgr-57f28014-cdf2-431e-8e11-447ba5c2b2f2/0b/shuffle_0_0_0.index > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at > org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:204) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:551) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:349) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:391) > at > org.apache.spark.storage.ShuffleBlo
[jira] [Commented] (SPARK-27318) Join operation on bucketing table fails with base adaptive enabled
[ https://issues.apache.org/jira/browse/SPARK-27318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947580#comment-16947580 ] Sandeep Katta commented on SPARK-27318: --- can you share me your bucket2.txt and bucket3.txt, Itried with sample data it is working for me. So better attach your data also > Join operation on bucketing table fails with base adaptive enabled > -- > > Key: SPARK-27318 > URL: https://issues.apache.org/jira/browse/SPARK-27318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Supritha >Priority: Major > > Join Operation on bucketed table is failing. > Steps to reproduce the issue. > {code} > spark.sql("set spark.sql.adaptive.enabled=true") > {code} > 1. Create table bukcet3 and bucket4 Table as below and load the data. > {code} > sql("create table bucket3(id3 int,country3 String, sports3 String) row format > delimited fields terminated by ','").show() > sql("create table bucket4(id4 int,country4 String) row format delimited > fields terminated by ','").show() > sql("load data local inpath '/opt/abhidata/bucket2.txt' into table > bucket3").show() > sql("load data local inpath '/opt/abhidata/bucket3.txt' into table > bucket4").show() > {code} > 2. Create bucketing table as below > {code} > spark.sqlContext.table("bucket3").write.bucketBy(3, > "id3").saveAsTable("bucketed_table_3"); > spark.sqlContext.table("bucket4").write.bucketBy(4, > "id4").saveAsTable("bucketed_table_4"); > {code} > 3. Execute the join query on the bucketed table > {code} > sql("select * from bucketed_table_3 join bucketed_table_4 on > bucketed_table_3.id3 = bucketed_table_4.id4").show() > {code} > > {code:java} > java.lang.IllegalArgumentException: requirement failed: > PartitioningCollection requires all of its partitionings have the same > numPartitions. at scala.Predef$.require(Predef.scala:224) at > org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:291) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering$1.apply(EnsureRequirements.scala:150) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering$1.apply(EnsureRequirements.scala:149) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.immutable.List.map(List.scala:296) at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:149) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:296) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:296) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:38) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124
[jira] [Comment Edited] (SPARK-27318) Join operation on bucketing table fails with base adaptive enabled
[ https://issues.apache.org/jira/browse/SPARK-27318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947580#comment-16947580 ] Sandeep Katta edited comment on SPARK-27318 at 10/9/19 11:27 AM: - [~Supritha] can you share me your bucket2.txt and bucket3.txt, Itried with sample data it is working for me. So better attach your data also was (Author: sandeep.katta2007): can you share me your bucket2.txt and bucket3.txt, Itried with sample data it is working for me. So better attach your data also > Join operation on bucketing table fails with base adaptive enabled > -- > > Key: SPARK-27318 > URL: https://issues.apache.org/jira/browse/SPARK-27318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Supritha >Priority: Major > > Join Operation on bucketed table is failing. > Steps to reproduce the issue. > {code} > spark.sql("set spark.sql.adaptive.enabled=true") > {code} > 1. Create table bukcet3 and bucket4 Table as below and load the data. > {code} > sql("create table bucket3(id3 int,country3 String, sports3 String) row format > delimited fields terminated by ','").show() > sql("create table bucket4(id4 int,country4 String) row format delimited > fields terminated by ','").show() > sql("load data local inpath '/opt/abhidata/bucket2.txt' into table > bucket3").show() > sql("load data local inpath '/opt/abhidata/bucket3.txt' into table > bucket4").show() > {code} > 2. Create bucketing table as below > {code} > spark.sqlContext.table("bucket3").write.bucketBy(3, > "id3").saveAsTable("bucketed_table_3"); > spark.sqlContext.table("bucket4").write.bucketBy(4, > "id4").saveAsTable("bucketed_table_4"); > {code} > 3. Execute the join query on the bucketed table > {code} > sql("select * from bucketed_table_3 join bucketed_table_4 on > bucketed_table_3.id3 = bucketed_table_4.id4").show() > {code} > > {code:java} > java.lang.IllegalArgumentException: requirement failed: > PartitioningCollection requires all of its partitionings have the same > numPartitions. at scala.Predef$.require(Predef.scala:224) at > org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:291) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering$1.apply(EnsureRequirements.scala:150) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering$1.apply(EnsureRequirements.scala:149) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.immutable.List.map(List.scala:296) at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:149) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:296) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:296) > at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:38) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(Q
[jira] [Commented] (SPARK-27695) SELECT * returns null column when reading from Hive / ORC and spark.sql.hive.convertMetastoreOrc=true
[ https://issues.apache.org/jira/browse/SPARK-27695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947385#comment-16947385 ] Sandeep Katta commented on SPARK-27695: --- does this jira applicable for 2.4 version also ? > SELECT * returns null column when reading from Hive / ORC and > spark.sql.hive.convertMetastoreOrc=true > - > > Key: SPARK-27695 > URL: https://issues.apache.org/jira/browse/SPARK-27695 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 2.3.1, 2.3.2, 2.3.3 >Reporter: Oscar Cassetti >Priority: Major > > If you do > {code:java} > select * from hive.some_table{code} > and the underlying data does not match exactly the schema the last column is > returned as null > Example > {code:java} > from pyspark import SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.types import * > conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true') > spark = > SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() > data = [{'a':i, 'b':i+10, 'd':{'a':i, 'b':i+10}} for i in range(1, 100)] > data_schema = StructType([StructField('a', LongType(), True), > StructField('b', LongType(), True), > StructField('d', MapType(StringType(), LongType(), True), True) > ]) > rdd = spark.sparkContext.parallelize(data) > df = rdd.toDF(data_schema) > df.write.format("orc").save("./sample_data/") > spark.sql("""create external table tmp( > a bigint, > b bigint, > d map) > stored as orc > location 'sample_data/' > """) > spark.sql("select * from tmp").show() > {code} > This return correctl > {noformat} > +---+---+---+ > | a| b| d| > +---+---+---+ > | 85| 95| [a -> 85, b -> 95]| > | 86| 96| [a -> 86, b -> 96]| > | 87| 97| [a -> 87, b -> 97]| > | 88| 98| [a -> 88, b -> 98]| > | 89| 99| [a -> 89, b -> 99]| > | 90|100|[a -> 90, b -> 100]| > {noformat} > However if add a new column in the underlying data without altering the hive > schema > the last column of the hive schema is set to null > {code} > from pyspark import SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.types import * > conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true') > spark = > SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() > data = [{'a':i, 'b':i+10, 'c':i+5, 'd':{'a':i, 'b':i+10, 'c':i+5}} for i in > range(1, 100)] > data_schema = StructType([StructField('a', LongType(), True), > StructField('b', LongType(), True), > StructField('c', LongType(), True), > StructField('d', MapType(StringType(), > LongType(), True), True) > ]) > rdd = spark.sparkContext.parallelize(data) > df = rdd.toDF(data_schema) > df.write.format("orc").mode("overwrite").save("./sample_data/") > spark.sql("select * from tmp").show() > spark.read.orc("./sample_data/").show() > {code} > The first show() returns > {noformat} > +---+---++ > | a| b| d| > +---+---++ > | 85| 95|null| > | 86| 96|null| > | 87| 97|null| > | 88| 98|null| > | 89| 99|null| > | 90|100|null| > | 91|101|null| > | 92|102|null| > | 93|103|null| > | 94|104|null| > {noformat} > But the data on disk is correct > {noformat} > +---+---+---++ > | a| b| c| d| > +---+---+---++ > | 85| 95| 90|[a -> 85, b -> 95...| > | 86| 96| 91|[a -> 86, b -> 96...| > | 87| 97| 92|[a -> 87, b -> 97...| > | 88| 98| 93|[a -> 88, b -> 98...| > | 89| 99| 94|[a -> 89, b -> 99...| > | 90|100| 95|[a -> 90, b -> 10...| > | 91|101| 96|[a -> 91, b -> 10...| > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939911#comment-16939911 ] Sandeep Katta commented on SPARK-29268: --- I found the reason, soon will raise PR > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled > -- > > Key: SPARK-29268 > URL: https://issues.apache.org/jira/browse/SPARK-29268 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled({{spark.sql.hive.metastore.jars != builtin}}). > How to reproduce: > {code:sh} > bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf > spark.sql.hive.metastore.jars=maven > {code} > Logs: > {noformat} > ... > Caused by: java.sql.SQLException: Failed to start database 'metastore_db' > with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see > the next exception for details. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 108 more > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 105 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29268) Failed to start spark-sql when using Derby metastore and isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939073#comment-16939073 ] Sandeep Katta commented on SPARK-29268: --- I will work on this > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled > -- > > Key: SPARK-29268 > URL: https://issues.apache.org/jira/browse/SPARK-29268 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to start spark-sql when using Derby metastore and isolatedLoader is > enabled({{spark.sql.hive.metastore.jars != builtin}}). > How to reproduce: > {code:sh} > bin/spark-sql --conf spark.sql.hive.metastore.version=2.1 --conf > spark.sql.hive.metastore.jars=maven > {code} > Logs: > {noformat} > ... > Caused by: java.sql.SQLException: Failed to start database 'metastore_db' > with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@15a591d9, see > the next exception for details. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 108 more > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 105 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /root/spark-2.3.4-bin-hadoop2.7/metastore_db. > ... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-29254: -- Comment: was deleted (was: [~yumwang] I would like to work on it if you have not started ) > Failed to include jars passed in through --jars when isolatedLoader is enabled > -- > > Key: SPARK-29254 > URL: https://issues.apache.org/jira/browse/SPARK-29254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to include jars passed in through --jars when {{isolatedLoader}} is > enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: > {code:scala} > test("SPARK-29254: include jars passed in through --jars when > isolatedLoader is enabled") { > val unusedJar = TestUtils.createJarWithClasses(Seq.empty) > val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA")) > val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB")) > val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath > val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath > val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => > j.toString).mkString(",") > val args = Seq( > "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"), > "--name", "SparkSubmitClassLoaderTest", > "--master", "local-cluster[2,1,1024]", > "--conf", "spark.ui.enabled=false", > "--conf", "spark.master.rest.enabled=false", > "--conf", "spark.sql.hive.metastore.version=3.1.2", > "--conf", "spark.sql.hive.metastore.jars=maven", > "--driver-java-options", "-Dderby.system.durability=test", > "--jars", jarsString, > unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB") > runSparkSubmit(args) > } > {code} > Logs: > {noformat} > 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in > initSerDe: java.lang.ClassNotFoundException Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:242) > 2019-09-25 22
[jira] [Commented] (SPARK-29254) Failed to include jars passed in through --jars when isolatedLoader is enabled
[ https://issues.apache.org/jira/browse/SPARK-29254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938434#comment-16938434 ] Sandeep Katta commented on SPARK-29254: --- [~yumwang] I would like to work on it if you have not started > Failed to include jars passed in through --jars when isolatedLoader is enabled > -- > > Key: SPARK-29254 > URL: https://issues.apache.org/jira/browse/SPARK-29254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Failed to include jars passed in through --jars when {{isolatedLoader}} is > enabled({{spark.sql.hive.metastore.jars != builtin}}). How to reproduce: > {code:scala} > test("SPARK-29254: include jars passed in through --jars when > isolatedLoader is enabled") { > val unusedJar = TestUtils.createJarWithClasses(Seq.empty) > val jar1 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassA")) > val jar2 = TestUtils.createJarWithClasses(Seq("SparkSubmitClassB")) > val jar3 = HiveTestJars.getHiveContribJar.getCanonicalPath > val jar4 = HiveTestJars.getHiveHcatalogCoreJar.getCanonicalPath > val jarsString = Seq(jar1, jar2, jar3, jar4).map(j => > j.toString).mkString(",") > val args = Seq( > "--class", SparkSubmitClassLoaderTest.getClass.getName.stripSuffix("$"), > "--name", "SparkSubmitClassLoaderTest", > "--master", "local-cluster[2,1,1024]", > "--conf", "spark.ui.enabled=false", > "--conf", "spark.master.rest.enabled=false", > "--conf", "spark.sql.hive.metastore.version=3.1.2", > "--conf", "spark.sql.hive.metastore.jars=maven", > "--driver-java-options", "-Dderby.system.durability=test", > "--jars", jarsString, > unusedJar.toString, "SparkSubmitClassA", "SparkSubmitClassB") > runSparkSubmit(args) > } > {code} > Logs: > {noformat} > 2019-09-25 22:11:42.854 - stderr> 19/09/25 22:11:42 ERROR log: error in > initSerDe: java.lang.ClassNotFoundException Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> java.lang.ClassNotFoundException: Class > org.apache.hive.hcatalog.data.JsonSerDe not found > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:84) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreUtils.getDeserializer(HiveMetaStoreUtils.java:77) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:289) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:271) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:663) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:646) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:898) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:539) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:311) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:294) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:537) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:284) > 2019-09-25 22:11:42.854 - stderr> at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > 2019-09-25 22:11:42.854 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.s
[jira] [Commented] (SPARK-29207) Document LIST JAR in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-29207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935574#comment-16935574 ] Sandeep Katta commented on SPARK-29207: --- List jar or List files are part of list command, you can use any one of the jira to fix this > Document LIST JAR in SQL Reference > -- > > Key: SPARK-29207 > URL: https://issues.apache.org/jira/browse/SPARK-29207 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29202) --driver-java-options are not passed to driver process in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-29202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-29202: -- Description: Run the below command ./bin/spark-sql --master yarn --driver-java-options="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=" In Spark 2.3.3 /opt/softwares/Java/jdk1.8.0_211/bin/java -cp /opt/BigdataTools/spark-2.3.3-bin-hadoop2.7/conf/:/opt/BigdataTools/spark-2.3.3-bin-hadoop2.7/jars/*:/opt/BigdataTools/hadoop-3.2.0/etc/hadoop/ -Xmx1g -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address= org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address= --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal In Spark 3.0 /opt/softwares/Java/jdk1.8.0_211/bin/java -cp /opt/apache/git/sparkSourceCode/spark/conf/:/opt/apache/git/sparkSourceCode/spark/assembly/target/scala-2.12/jars/*:/opt/BigdataTools/hadoop-3.2.0/etc/hadoop/ org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5556 --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal We can see that java options are not passed to driver process in spark3 > --driver-java-options are not passed to driver process in yarn client mode > -- > > Key: SPARK-29202 > URL: https://issues.apache.org/jira/browse/SPARK-29202 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > > Run the below command > ./bin/spark-sql --master yarn > --driver-java-options="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=" > > In Spark 2.3.3 > /opt/softwares/Java/jdk1.8.0_211/bin/java -cp > /opt/BigdataTools/spark-2.3.3-bin-hadoop2.7/conf/:/opt/BigdataTools/spark-2.3.3-bin-hadoop2.7/jars/*:/opt/BigdataTools/hadoop-3.2.0/etc/hadoop/ > -Xmx1g -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address= > org.apache.spark.deploy.SparkSubmit --master yarn --conf > spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address= > --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver > spark-internal > > In Spark 3.0 > /opt/softwares/Java/jdk1.8.0_211/bin/java -cp > /opt/apache/git/sparkSourceCode/spark/conf/:/opt/apache/git/sparkSourceCode/spark/assembly/target/scala-2.12/jars/*:/opt/BigdataTools/hadoop-3.2.0/etc/hadoop/ > org.apache.spark.deploy.SparkSubmit --master yarn --conf > spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5556 > --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver > spark-internal > We can see that java options are not passed to driver process in spark3 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29202) --driver-java-options are not passed to driver process in yarn client mode
Sandeep Katta created SPARK-29202: - Summary: --driver-java-options are not passed to driver process in yarn client mode Key: SPARK-29202 URL: https://issues.apache.org/jira/browse/SPARK-29202 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 3.0.0 Reporter: Sandeep Katta -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29180) drop database throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta resolved SPARK-29180. --- Resolution: Invalid User requried to run hive-txn-schema-2.3.0.mysql.sql and hive-txn-schema-2.3.0.mysql.sql after upgrading to Hive 2.3.6 > drop database throws Exception > --- > > Key: SPARK-29180 > URL: https://issues.apache.org/jira/browse/SPARK-29180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > Attachments: DROP_DATABASE_Exception.png, > image-2019-09-22-09-32-57-532.png > > > drop database throwing Exception but result is success. > > : jdbc:hive2://10.18.19.208:23040/default> show databases; > +-+ > | databaseName | > +-+ > | db1 | > | db2 | > | default | > | func | > | gloablelimit | > | jointesthll | > | sparkdb__ | > | temp_func_test | > | test1 | > +-+ > 9 rows selected (0.131 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> drop database test1 cascade; > *Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:Unable to clean up > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table > 'sparksql.TXN_COMPONENTS' doesn' exist* > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:408) > at com.mysql.jdbc.Util.getInstance(Util.java:383) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4208) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4140) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2597) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2758) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2820) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1759) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1679) > at > com.jolbox.bonecp.StatementHandle.executeUpdate(StatementHandle.java:497) > at > org.apache.hadoop.hive.metastore.txn.TxnHandler.cleanupRecords(TxnHandler.java:1888) > at > org.apache.hadoop.hive.metastore.AcidEventListener.onDropDatabase(AcidEventListener.java:51) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$13.notify(MetaStoreListenerNotifier.java:69) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:167) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:197) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:235) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database_core(HiveMetaStore.java:1139) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database(HiveMetaStore.java:1175) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy31.drop_database(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:868) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy32.dropDatabase(Unknown Source) > at org.apache.hadoop.hive.ql.metadata
[jira] [Commented] (SPARK-29180) drop database throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935201#comment-16935201 ] Sandeep Katta commented on SPARK-29180: --- [~abhishek.akg] I have verified this after I run the above mentioned scripts !image-2019-09-22-09-32-57-532.png! can we close this is as invalid ? > drop database throws Exception > --- > > Key: SPARK-29180 > URL: https://issues.apache.org/jira/browse/SPARK-29180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: DROP_DATABASE_Exception.png, > image-2019-09-22-09-32-57-532.png > > > drop database throwing Exception but result is success. > > : jdbc:hive2://10.18.19.208:23040/default> show databases; > +-+ > | databaseName | > +-+ > | db1 | > | db2 | > | default | > | func | > | gloablelimit | > | jointesthll | > | sparkdb__ | > | temp_func_test | > | test1 | > +-+ > 9 rows selected (0.131 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> drop database test1 cascade; > *Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:Unable to clean up > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table > 'sparksql.TXN_COMPONENTS' doesn' exist* > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:408) > at com.mysql.jdbc.Util.getInstance(Util.java:383) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4208) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4140) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2597) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2758) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2820) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1759) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1679) > at > com.jolbox.bonecp.StatementHandle.executeUpdate(StatementHandle.java:497) > at > org.apache.hadoop.hive.metastore.txn.TxnHandler.cleanupRecords(TxnHandler.java:1888) > at > org.apache.hadoop.hive.metastore.AcidEventListener.onDropDatabase(AcidEventListener.java:51) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$13.notify(MetaStoreListenerNotifier.java:69) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:167) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:197) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:235) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database_core(HiveMetaStore.java:1139) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database(HiveMetaStore.java:1175) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy31.drop_database(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:868) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy32.dropDatabas
[jira] [Updated] (SPARK-29180) drop database throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-29180: -- Attachment: image-2019-09-22-09-32-57-532.png > drop database throws Exception > --- > > Key: SPARK-29180 > URL: https://issues.apache.org/jira/browse/SPARK-29180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: DROP_DATABASE_Exception.png, > image-2019-09-22-09-32-57-532.png > > > drop database throwing Exception but result is success. > > : jdbc:hive2://10.18.19.208:23040/default> show databases; > +-+ > | databaseName | > +-+ > | db1 | > | db2 | > | default | > | func | > | gloablelimit | > | jointesthll | > | sparkdb__ | > | temp_func_test | > | test1 | > +-+ > 9 rows selected (0.131 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> drop database test1 cascade; > *Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:Unable to clean up > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table > 'sparksql.TXN_COMPONENTS' doesn' exist* > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:408) > at com.mysql.jdbc.Util.getInstance(Util.java:383) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4208) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4140) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2597) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2758) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2820) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1759) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1679) > at > com.jolbox.bonecp.StatementHandle.executeUpdate(StatementHandle.java:497) > at > org.apache.hadoop.hive.metastore.txn.TxnHandler.cleanupRecords(TxnHandler.java:1888) > at > org.apache.hadoop.hive.metastore.AcidEventListener.onDropDatabase(AcidEventListener.java:51) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$13.notify(MetaStoreListenerNotifier.java:69) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:167) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:197) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:235) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database_core(HiveMetaStore.java:1139) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database(HiveMetaStore.java:1175) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy31.drop_database(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:868) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy32.dropDatabase(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.dropDatabase(Hive.java:484) > at > org.apache.spark.sql.hive.client.HiveClientImpl.
[jira] [Commented] (SPARK-29180) drop database throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935199#comment-16935199 ] Sandeep Katta commented on SPARK-29180: --- ahhh !!!, [~abhishek.akg] after you upgrade to Hive 2.3.5, you need to run the following scripts [https://github.com/apache/hive/blob/rel/release-2.3.6/metastore/scripts/upgrade/mysql/hive-schema-2.3.0.mysql.sql] and [https://github.com/apache/hive/blob/rel/release-2.3.6/metastore/scripts/upgrade/mysql/hive-txn-schema-2.3.0.mysql.sql] Please run these commands and check again > drop database throws Exception > --- > > Key: SPARK-29180 > URL: https://issues.apache.org/jira/browse/SPARK-29180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: DROP_DATABASE_Exception.png > > > drop database throwing Exception but result is success. > > : jdbc:hive2://10.18.19.208:23040/default> show databases; > +-+ > | databaseName | > +-+ > | db1 | > | db2 | > | default | > | func | > | gloablelimit | > | jointesthll | > | sparkdb__ | > | temp_func_test | > | test1 | > +-+ > 9 rows selected (0.131 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> drop database test1 cascade; > *Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:Unable to clean up > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table > 'sparksql.TXN_COMPONENTS' doesn' exist* > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:408) > at com.mysql.jdbc.Util.getInstance(Util.java:383) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4208) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4140) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2597) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2758) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2820) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1759) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1679) > at > com.jolbox.bonecp.StatementHandle.executeUpdate(StatementHandle.java:497) > at > org.apache.hadoop.hive.metastore.txn.TxnHandler.cleanupRecords(TxnHandler.java:1888) > at > org.apache.hadoop.hive.metastore.AcidEventListener.onDropDatabase(AcidEventListener.java:51) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$13.notify(MetaStoreListenerNotifier.java:69) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:167) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:197) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:235) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database_core(HiveMetaStore.java:1139) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database(HiveMetaStore.java:1175) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy31.drop_database(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:868) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.
[jira] [Comment Edited] (SPARK-29180) drop database throws Exception
[ https://issues.apache.org/jira/browse/SPARK-29180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935194#comment-16935194 ] Sandeep Katta edited comment on SPARK-29180 at 9/22/19 3:12 AM: [~dongjoon] I found the reason but thinking it is requried to fix in spark or Hive As per jira description database used in MYSQL Hive code https://github.com/apache/hive/blob/rel/release-2.3.6/metastore/src/java/org/apache/hadoop/hive/metastore/txn/TxnHandler.java#L1897 Hive checked for *"does not exist"* but MYSQL thrown exception with *"doesn't exist"* was (Author: sandeep.katta2007): [~dongjoon] I found the reason but thinking it is requried to fix in spark or Hive As per jira description database used in MYSQL Hive code https://github.com/apache/hive/blob/rel/release-2.3.6/metastore/src/java/org/apache/hadoop/hive/metastore/txn/TxnHandler.java#L1897 Hive checked for* "does not exist"* but MYSQL thrown exception with *"doesn't exist"* > drop database throws Exception > --- > > Key: SPARK-29180 > URL: https://issues.apache.org/jira/browse/SPARK-29180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: DROP_DATABASE_Exception.png > > > drop database throwing Exception but result is success. > > : jdbc:hive2://10.18.19.208:23040/default> show databases; > +-+ > | databaseName | > +-+ > | db1 | > | db2 | > | default | > | func | > | gloablelimit | > | jointesthll | > | sparkdb__ | > | temp_func_test | > | test1 | > +-+ > 9 rows selected (0.131 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> drop database test1 cascade; > *Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:Unable to clean up > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table > 'sparksql.TXN_COMPONENTS' doesn' exist* > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at com.mysql.jdbc.Util.handleNewInstance(Util.java:408) > at com.mysql.jdbc.Util.getInstance(Util.java:383) > at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4208) > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4140) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2597) > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2758) > at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2820) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1759) > at com.mysql.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1679) > at > com.jolbox.bonecp.StatementHandle.executeUpdate(StatementHandle.java:497) > at > org.apache.hadoop.hive.metastore.txn.TxnHandler.cleanupRecords(TxnHandler.java:1888) > at > org.apache.hadoop.hive.metastore.AcidEventListener.onDropDatabase(AcidEventListener.java:51) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$13.notify(MetaStoreListenerNotifier.java:69) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:167) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:197) > at > org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:235) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database_core(HiveMetaStore.java:1139) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_database(HiveMetaStore.java:1175) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy31.drop_database(U