[jira] [Resolved] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30896. -- Resolution: Later > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056672#comment-17056672 ] Hyukjin Kwon commented on SPARK-30896: -- Yeah, let's don't fix it for now. > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31113) Support DDL "SHOW VIEWS"
[ https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056669#comment-17056669 ] Xin Wu commented on SPARK-31113: Sure, I'm working on this! Thanks [~smilegator] > Support DDL "SHOW VIEWS" > > > Key: SPARK-31113 > URL: https://issues.apache.org/jira/browse/SPARK-31113 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > It is nice to have a `SHOW VIEWS` command similar to Hive > (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31070) make skew join split skewed partitions more evenly
[ https://issues.apache.org/jira/browse/SPARK-31070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-31070. - Fix Version/s: 3.0.0 Resolution: Fixed > make skew join split skewed partitions more evenly > -- > > Key: SPARK-31070 > URL: https://issues.apache.org/jira/browse/SPARK-31070 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056636#comment-17056636 ] Wenchen Fan commented on SPARK-30896: - More importantly, what should be the official way when we add new configs that can affect expression behavior? Shall we just store the config value in a `val` or put it in the expression constructor? also cc [~Gengliang.Wang] > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30895) The behavior of CsvToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056634#comment-17056634 ] Wenchen Fan commented on SPARK-30895: - See https://issues.apache.org/jira/browse/SPARK-30896 > The behavior of CsvToStructs should not depend on SQLConf.get > - > > Key: SPARK-30895 > URL: https://issues.apache.org/jira/browse/SPARK-30895 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30895) The behavior of CsvToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30895: Priority: Minor (was: Major) > The behavior of CsvToStructs should not depend on SQLConf.get > - > > Key: SPARK-30895 > URL: https://issues.apache.org/jira/browse/SPARK-30895 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056633#comment-17056633 ] Wenchen Fan commented on SPARK-31099: - Is this only a problem for local hive metastore setup? > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374) > at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144) > at > org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342) > at >
[jira] [Reopened] (SPARK-30895) The behavior of CsvToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-30895: - > The behavior of CsvToStructs should not depend on SQLConf.get > - > > Key: SPARK-30895 > URL: https://issues.apache.org/jira/browse/SPARK-30895 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056632#comment-17056632 ] Wenchen Fan commented on SPARK-30896: - `JsonToStructs` already stores the config value in a `val`, so the behavior won't change after the expression is created. There are some corner cases when we transform the expression tree with a different config, but it's not a critical bug. I've updated the priority to minor. [~viirya] [~maropu] [~hyukjin.kwon] do you think it's worth to fix? > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-30896: - > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30896: Priority: Minor (was: Major) > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30893) Expressions should not change its data type/nullability after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30893: Summary: Expressions should not change its data type/nullability after it's created (was: Expressions should not change its data type/behavior after it's created) > Expressions should not change its data type/nullability after it's created > -- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > Fix For: 3.0.0 > > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31113) Support DDL "SHOW VIEWS"
[ https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-31113: Issue Type: New Feature (was: Bug) > Support DDL "SHOW VIEWS" > > > Key: SPARK-31113 > URL: https://issues.apache.org/jira/browse/SPARK-31113 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > It is nice to have a `SHOW VIEWS` command similar to Hive > (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31113) Support DDL "SHOW VIEWS"
[ https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056628#comment-17056628 ] Xiao Li edited comment on SPARK-31113 at 3/11/20, 4:04 AM: --- cc [~EricWu] Could you try this? was (Author: smilegator): cc [~EricWu] > Support DDL "SHOW VIEWS" > > > Key: SPARK-31113 > URL: https://issues.apache.org/jira/browse/SPARK-31113 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > It is nice to have a `SHOW VIEWS` command similar to Hive > (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31113) Support DDL "SHOW VIEWS"
[ https://issues.apache.org/jira/browse/SPARK-31113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056628#comment-17056628 ] Xiao Li commented on SPARK-31113: - cc [~EricWu] > Support DDL "SHOW VIEWS" > > > Key: SPARK-31113 > URL: https://issues.apache.org/jira/browse/SPARK-31113 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > It is nice to have a `SHOW VIEWS` command similar to Hive > (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31113) Support DDL "SHOW VIEWS"
Xiao Li created SPARK-31113: --- Summary: Support DDL "SHOW VIEWS" Key: SPARK-31113 URL: https://issues.apache.org/jira/browse/SPARK-31113 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li It is nice to have a `SHOW VIEWS` command similar to Hive (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31112) Use multiple extrenal catalog to speed up metastore access
deshanxiao created SPARK-31112: -- Summary: Use multiple extrenal catalog to speed up metastore access Key: SPARK-31112 URL: https://issues.apache.org/jira/browse/SPARK-31112 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: deshanxiao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-30896. - Resolution: Won't Fix > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30895) The behavior of CsvToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-30895. - Resolution: Won't Fix > The behavior of CsvToStructs should not depend on SQLConf.get > - > > Key: SPARK-30895 > URL: https://issues.apache.org/jira/browse/SPARK-30895 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30893) Expressions should not change its data type/behavior after it's created
[ https://issues.apache.org/jira/browse/SPARK-30893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-30893. - Fix Version/s: 3.0.0 Resolution: Fixed > Expressions should not change its data type/behavior after it's created > --- > > Key: SPARK-30893 > URL: https://issues.apache.org/jira/browse/SPARK-30893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Critical > Fix For: 3.0.0 > > > This is a problem because the configuration can change between different > phases of planning, and this can silently break a query plan which can lead > to crashes or data corruption, if data type/nullability gets changed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31104) Add documentation for all the Json Functions
[ https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056584#comment-17056584 ] Rakesh Raushan commented on SPARK-31104: I am working on it. > Add documentation for all the Json Functions > > > Key: SPARK-31104 > URL: https://issues.apache.org/jira/browse/SPARK-31104 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31095) Upgrade netty-all to 4.1.47.Final
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056561#comment-17056561 ] Dongjoon Hyun commented on SPARK-31095: --- For `branch-2.4`, https://github.com/apache/spark/pull/27870 is created. > Upgrade netty-all to 4.1.47.Final > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Vishwas Vijaya Kumar >Assignee: Dongjoon Hyun >Priority: Major > Labels: security > Fix For: 3.0.0 > > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31095) Upgrade netty-all to 4.1.47.Final
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31095. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27869 [https://github.com/apache/spark/pull/27869] > Upgrade netty-all to 4.1.47.Final > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Vishwas Vijaya Kumar >Assignee: Dongjoon Hyun >Priority: Major > Labels: security > Fix For: 3.0.0 > > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31095) Upgrade netty-all to 4.1.47.Final
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31095: - Assignee: Dongjoon Hyun > Upgrade netty-all to 4.1.47.Final > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Vishwas Vijaya Kumar >Assignee: Dongjoon Hyun >Priority: Major > Labels: security > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31095) Upgrade netty-all to 4.1.47.Final
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31095: -- Summary: Upgrade netty-all to 4.1.47.Final (was: Upgrade netty version to fix security vulnerabilities) > Upgrade netty-all to 4.1.47.Final > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Vishwas Vijaya Kumar >Priority: Major > Labels: security > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31095) Upgrade netty version to fix security vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31095: -- Affects Version/s: (was: 2.4.4) 3.1.0 3.0.0 > Upgrade netty version to fix security vulnerabilities > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Vishwas Vijaya Kumar >Priority: Major > Labels: security > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056549#comment-17056549 ] Kris Mok commented on SPARK-31099: -- Just documenting the fact that users may encounter migration issues when upgrading from earlier versions of Spark to Spark 3.0 due to the Hive profile upgrade sounds good to me. Derby migration is unlikely to be a production issue, and for other databases (MySQL / PG etc) they’re heavy enough that folks would probably realize it’s a Hive metastore migration issue just like what’d happen in Hive. But the documentation should at the very least describe: * upgraded Hive profile * what kind of error messages could occur * links to Hive documentation of how to perform the upgrade WDYT? > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at
[jira] [Resolved] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]
[ https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30962. -- Fix Version/s: 3.0.0 Assignee: kevin yu Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/27779] > Document ALTER TABLE statement in SQL Reference [Phase 2] > - > > Key: SPARK-30962 > URL: https://issues.apache.org/jira/browse/SPARK-30962 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: kevin yu >Priority: Major > Fix For: 3.0.0 > > > https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of > ALTER TABLE statements. See the doc in preview-2 > [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html] > > We should add all the supported ALTER TABLE syntax. See > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31102) spark-sql fails to parse when contains comment
[ https://issues.apache.org/jira/browse/SPARK-31102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056466#comment-17056466 ] Javier Fuentes commented on SPARK-31102: Hey [~yumwang] I am checking this. Thanks! > spark-sql fails to parse when contains comment > -- > > Key: SPARK-31102 > URL: https://issues.apache.org/jira/browse/SPARK-31102 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > select > 1, > -- two > 2; > {code} > {noformat} > spark-sql> select > > 1, > > -- two > > 2; > Error in query: > mismatched input '' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', > 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', > 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', > 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', > 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', > 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', > 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', > 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', > DATABASES, 'DAY', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', > 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', > 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', > 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', > 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', > 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', > 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'IF', 'IGNORE', 'IMPORT', > 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', > 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', > 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', > 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', > 'MATCHED', 'MERGE', 'MINUTE', 'MONTH', 'MSCK', 'NAMESPACE', 'NAMESPACES', > 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', > 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', > 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', > 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', > 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', > 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', > 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', > 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SECOND', 'SELECT', 'SEMI', 'SEPARATED', > 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', > 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', > 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', > 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', > 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', > 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', > 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', > 'VALUES', 'VIEW', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'YEAR', '+', '-', '*', > 'DIV', '~', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, > INTEGER_VALUE, EXPONENT_VALUE, DECIMAL_VALUE, DOUBLE_LITERAL, > BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 2) > == SQL == > select > 1, > --^^^ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056465#comment-17056465 ] Jungtaek Lim commented on SPARK-31099: -- [~dongjoon] Could you elaborate your comment "Apache Spark 3.0 also doesn't support restarting from the old streaming checkpoint."? For sure Spark 3.0 should support the old checkpoint, except some cases which we have to discard old checkpoint to fix correctness issues. > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374) > at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144) >
[jira] [Commented] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056461#comment-17056461 ] Gengliang Wang commented on SPARK-31099: [~dongjoon]Make sense. Let me close this one. Thank you. > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374) > at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144) > at > org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342) > at >
[jira] [Resolved] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-31099. Resolution: Won't Fix > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374) > at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144) > at > org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303) > at >
[jira] [Updated] (SPARK-31095) Upgrade netty version to fix security vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31095: -- Component/s: (was: Security) Build Fix Version/s: (was: 2.4.5) (was: 2.4.4) > Upgrade netty version to fix security vulnerabilities > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 2.4.5 >Reporter: Vishwas Vijaya Kumar >Priority: Major > Labels: security > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31095) Upgrade netty version to fix security vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31095: -- Priority: Major (was: Critical) > Upgrade netty version to fix security vulnerabilities > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Improvement > Components: Security >Affects Versions: 2.4.4, 2.4.5 >Reporter: Vishwas Vijaya Kumar >Priority: Major > Labels: security > Fix For: 2.4.4, 2.4.5 > > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31095) Upgrade netty version to fix security vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31095: -- Issue Type: Bug (was: Improvement) > Upgrade netty version to fix security vulnerabilities > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.4.4, 2.4.5 >Reporter: Vishwas Vijaya Kumar >Priority: Major > Labels: security > Fix For: 2.4.4, 2.4.5 > > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31095) Upgrade netty version to fix security vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056444#comment-17056444 ] Dongjoon Hyun commented on SPARK-31095: --- Hi, [~visvijay]. You should not set `Fix Version`. Please see the contribution guide. - https://spark.apache.org/contributing.html > Upgrade netty version to fix security vulnerabilities > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Improvement > Components: Security >Affects Versions: 2.4.4, 2.4.5 >Reporter: Vishwas Vijaya Kumar >Priority: Critical > Labels: security > Fix For: 2.4.4, 2.4.5 > > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056439#comment-17056439 ] Dongjoon Hyun commented on SPARK-31098: --- Thank you, [~Gengliang.Wang]. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Commented] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056436#comment-17056436 ] Dongjoon Hyun commented on SPARK-31099: --- Hi, [~Gengliang.Wang] and [~smilegator] and [~cloud_fan]. This doesn't sound like what Apache Spark provides officially. If needed, the users can use the official scripts by `Apache Hive` project. In addition to that, Apache Spark 3.0 also doesn't support restarting from the old streaming checkpoint. > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374) > at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184) > at >
[jira] [Updated] (SPARK-31102) spark-sql fails to parse when contains comment
[ https://issues.apache.org/jira/browse/SPARK-31102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31102: -- Target Version/s: 3.0.0 > spark-sql fails to parse when contains comment > -- > > Key: SPARK-31102 > URL: https://issues.apache.org/jira/browse/SPARK-31102 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > select > 1, > -- two > 2; > {code} > {noformat} > spark-sql> select > > 1, > > -- two > > 2; > Error in query: > mismatched input '' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', > 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', > 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', > 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', > 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', > 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', > 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', > 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', > DATABASES, 'DAY', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', > 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', > 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', > 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', > 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', > 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', > 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'IF', 'IGNORE', 'IMPORT', > 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', > 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', > 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', > 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', > 'MATCHED', 'MERGE', 'MINUTE', 'MONTH', 'MSCK', 'NAMESPACE', 'NAMESPACES', > 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', > 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', > 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', > 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', > 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', > 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', > 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', > 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SECOND', 'SELECT', 'SEMI', 'SEPARATED', > 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', > 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', > 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', > 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', > 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', > 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', > 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', > 'VALUES', 'VIEW', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'YEAR', '+', '-', '*', > 'DIV', '~', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, > INTEGER_VALUE, EXPONENT_VALUE, DECIMAL_VALUE, DOUBLE_LITERAL, > BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 2) > == SQL == > select > 1, > --^^^ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-30667. -- Resolution: Done > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Sarth Frey >Priority: Major > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28594: -- Summary: Allow event logs for running streaming apps to be rolled over (was: Allow event logs for running streaming apps to be rolled over.) > Allow event logs for running streaming apps to be rolled over > - > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stephen Levett >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications
[ https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22783: -- Parent: SPARK-28594 Issue Type: Sub-task (was: Bug) > event log directory(spark-history) filled by large .inprogress files for > spark streaming applications > - > > Key: SPARK-22783 > URL: https://issues.apache.org/jira/browse/SPARK-22783 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0, 2.1.0 > Environment: Linux(Generic) >Reporter: omkar kankalapati >Priority: Major > > When running long running streaming applications, the HDFS storage gets > filled up with large *.inprogress files in hdfs://spark-history/ directory > For example: > hadoop fs -du -h /spark-history > 234 /spark-history/.inprogress > 46.6 G /spark-history/.inprogress > Instead of continuing to write to a very large (multi GB) .inprogress file, > Spark should instead rotate the current log file when it reaches a size (for > example: 100 MB) or interval > and perhaps expose a configuration parameter for the size/interval. > This is also mentioned in SPARK-12140 as a concern. > It is very important and useful to support rotating the log files because > users may have limited HDFS quota and these large files consume the available > limited quota. > Also the users do not have a viable workaround > 1) Can not move the files to an another location because the moving the file > causes the event logging to stop > 2) Trying to copy the .inprogress file to another location and truncate the > .inprogress file fails because the file is still opened by > EventLoggingListener for writing > hdfs dfs -truncate -w 0 /spark-history/.inprogress > truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress > for DFSClient_NONMAPREDUCE_<#ID>on because this file lease is currently > owned by DFSClient_NONMAPREDUCE_<#ID> on > The only workaround available is to disable the event logging for streaming > applications by setting "spark.eventLog.enabled" to false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications
[ https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-22783. - > event log directory(spark-history) filled by large .inprogress files for > spark streaming applications > - > > Key: SPARK-22783 > URL: https://issues.apache.org/jira/browse/SPARK-22783 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0, 2.1.0 > Environment: Linux(Generic) >Reporter: omkar kankalapati >Priority: Major > > When running long running streaming applications, the HDFS storage gets > filled up with large *.inprogress files in hdfs://spark-history/ directory > For example: > hadoop fs -du -h /spark-history > 234 /spark-history/.inprogress > 46.6 G /spark-history/.inprogress > Instead of continuing to write to a very large (multi GB) .inprogress file, > Spark should instead rotate the current log file when it reaches a size (for > example: 100 MB) or interval > and perhaps expose a configuration parameter for the size/interval. > This is also mentioned in SPARK-12140 as a concern. > It is very important and useful to support rotating the log files because > users may have limited HDFS quota and these large files consume the available > limited quota. > Also the users do not have a viable workaround > 1) Can not move the files to an another location because the moving the file > causes the event logging to stop > 2) Trying to copy the .inprogress file to another location and truncate the > .inprogress file fails because the file is still opened by > EventLoggingListener for writing > hdfs dfs -truncate -w 0 /spark-history/.inprogress > truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress > for DFSClient_NONMAPREDUCE_<#ID>on because this file lease is currently > owned by DFSClient_NONMAPREDUCE_<#ID> on > The only workaround available is to disable the event logging for streaming > applications by setting "spark.eventLog.enabled" to false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28594: -- Environment: (was: This has been reported on 2.0.2.22 but affects all currently available versions.) > Allow event logs for running streaming apps to be rolled over. > -- > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stephen Levett >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056417#comment-17056417 ] Dongjoon Hyun commented on SPARK-28594: --- I assigned this umbrella to [~kabhwan] since he lead this actively. > Allow event logs for running streaming apps to be rolled over. > -- > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: This has been reported on 2.0.2.22 but affects all > currently available versions. >Reporter: Stephen Levett >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28594: - Assignee: Jungtaek Lim > Allow event logs for running streaming apps to be rolled over. > -- > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: This has been reported on 2.0.2.22 but affects all > currently available versions. >Reporter: Stephen Levett >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-29581) Enable cleanup old event log files
[ https://issues.apache.org/jira/browse/SPARK-29581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-29581. - > Enable cleanup old event log files > --- > > Key: SPARK-29581 > URL: https://issues.apache.org/jira/browse/SPARK-29581 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue can be start only with SPARK-29579 is addressed properly. > After SPARK-29579 Spark would guarantee strong compatibility on both live > entities and snapshots, which means snapshot file could replace older origin > event log files. This issue tracks the efforts on automatically cleaning up > old event logs if snapshot file can replace them, which lets overall size of > event log on streaming query to be manageable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30860) Different behavior between rolling and non-rolling event log
[ https://issues.apache.org/jira/browse/SPARK-30860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30860: -- Parent: SPARK-28594 Issue Type: Sub-task (was: Bug) > Different behavior between rolling and non-rolling event log > > > Key: SPARK-30860 > URL: https://issues.apache.org/jira/browse/SPARK-30860 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Adam Binford >Priority: Major > > When creating a rolling event log, the application directory is created with > a call to FileSystem.mkdirs, with the file permission 770. The default > behavior of HDFS is to set the permission of a file created with > FileSystem.create or FileSystem.mkdirs to (P & ^umask), where P is the > permission in the API call and umask is a system value set by > fs.permissions.umask-mode and defaults to 0022. This means, with default > settings, any mkdirs call can have at most 755 permissions, which causes > rolling event log directories to be created with 750 permissions. This causes > the history server to be unable to prune old applications if they are not run > by the same user running the history server. > This is not a problem for non-rolling logs, because it uses > SparkHadoopUtils.createFile for Hadoop 2 backward compatibility, and then > calls FileSystem.setPermission with 770 after the file has been created. > setPermission doesn't have the umask applied to it, so this works fine. > Obviously this could be fixed by changing fs.permissions.umask-mode, but I'm > not sure the reason that's set in the first place or if this would hurt > anything else. The main issue is there is different behavior between rolling > and non-rolling event logs that might want to be updated in this repo to be > consistent across each. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056331#comment-17056331 ] Gengliang Wang commented on SPARK-31098: [~dongjoon]Thanks for the explanation! I am closing this issue for now. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Resolved] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-31098. Resolution: Later > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at >
[jira] [Updated] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31098: -- Issue Type: Bug (was: Improvement) > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at >
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056326#comment-17056326 ] Dongjoon Hyun commented on SPARK-31098: --- Please note that the fixed behavior is not also desirable sometime because the schema is randomly chosen by the smallest ORC file. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056324#comment-17056324 ] Dongjoon Hyun commented on SPARK-31098: --- It will work only in the above cases (I also checked yesterday) but broke the other test cases. If you want, you can make a PR with a complete patch. However, in general, we may end up upgrading ORC dependency in `branch-2.4`. I'd like to hold on the ORC dependency upgrade in `branch-2.4` because ORC changes a lot even in `1.5.x`. (I send a relevant email to the dev mailing list before). I prefer to rethink backporting again after we release `Apache Spark 3.0.0` with the new ORC versions and it looks stable in many environment. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) >
[jira] [Updated] (SPARK-31110) refine sql doc for SELECT
[ https://issues.apache.org/jira/browse/SPARK-31110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31110: -- Issue Type: Documentation (was: Improvement) > refine sql doc for SELECT > - > > Key: SPARK-31110 > URL: https://issues.apache.org/jira/browse/SPARK-31110 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31110) refine sql doc for SELECT
[ https://issues.apache.org/jira/browse/SPARK-31110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31110: -- Component/s: Documentation > refine sql doc for SELECT > - > > Key: SPARK-31110 > URL: https://issues.apache.org/jira/browse/SPARK-31110 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056286#comment-17056286 ] Gengliang Wang commented on SPARK-31098: [~dongjoon]Thank you so much for looking into it. I tried porting the changes https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L167-L169 and it works. Why do you think it is risky? > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at >
[jira] [Updated] (SPARK-30510) Publicly document options under spark.sql.*
[ https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30510: Labels: release-notes (was: ) > Publicly document options under spark.sql.* > --- > > Key: SPARK-30510 > URL: https://issues.apache.org/jira/browse/SPARK-30510 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Assignee: Hyukjin Kwon >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, > but it doesn't appear to be documented in [the expected > place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none > of the options under {{spark.sql.*}} that are intended for users are > documented on spark.apache.org/docs. > We should add a new documentation page for these options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31037) refine AQE config names
[ https://issues.apache.org/jira/browse/SPARK-31037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056227#comment-17056227 ] Dongjoon Hyun commented on SPARK-31037: --- This is resolved via https://github.com/apache/spark/pull/27793 . > refine AQE config names > --- > > Key: SPARK-31037 > URL: https://issues.apache.org/jira/browse/SPARK-31037 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31037) refine AQE config names
[ https://issues.apache.org/jira/browse/SPARK-31037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31037. --- Fix Version/s: 3.0.0 Assignee: Wenchen Fan Resolution: Fixed > refine AQE config names > --- > > Key: SPARK-31037 > URL: https://issues.apache.org/jira/browse/SPARK-31037 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31111) Fix interval output issue in ExtractBenchmark
Kent Yao created SPARK-3: Summary: Fix interval output issue in ExtractBenchmark Key: SPARK-3 URL: https://issues.apache.org/jira/browse/SPARK-3 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31110) refine sql doc for SELECT
Wenchen Fan created SPARK-31110: --- Summary: refine sql doc for SELECT Key: SPARK-31110 URL: https://issues.apache.org/jira/browse/SPARK-31110 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30189) Interval from year-month/date-time string handling whitespaces
[ https://issues.apache.org/jira/browse/SPARK-30189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30189. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26815 [https://github.com/apache/spark/pull/26815] > Interval from year-month/date-time string handling whitespaces > -- > > Key: SPARK-30189 > URL: https://issues.apache.org/jira/browse/SPARK-30189 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > # for pg feature parity > # for consistency with other types and other interval parser > > {code:sql} > ostgres=# select interval E'2-2\t' year to month; > interval > > 2 years 2 mons > (1 row) > postgres=# select interval E'2-2\t' year to month; > interval > > 2 years 2 mons > (1 row) > postgres=# select interval E'2-\t2\t' year to month; > ERROR: invalid input syntax for type interval: "2- 2 " > LINE 1: select interval E'2-\t2\t' year to month; > ^ > postgres=# select interval '2 00:00:01' day to second; > interval > - > 2 days 00:00:01 > (1 row) > postgres=# select interval '- 2 00:00:01' day to second; > interval > --- > -2 days +00:00:01 > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30189) Interval from year-month/date-time string handling whitespaces
[ https://issues.apache.org/jira/browse/SPARK-30189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30189: --- Assignee: Kent Yao > Interval from year-month/date-time string handling whitespaces > -- > > Key: SPARK-30189 > URL: https://issues.apache.org/jira/browse/SPARK-30189 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > # for pg feature parity > # for consistency with other types and other interval parser > > {code:sql} > ostgres=# select interval E'2-2\t' year to month; > interval > > 2 years 2 mons > (1 row) > postgres=# select interval E'2-2\t' year to month; > interval > > 2 years 2 mons > (1 row) > postgres=# select interval E'2-\t2\t' year to month; > ERROR: invalid input syntax for type interval: "2- 2 " > LINE 1: select interval E'2-\t2\t' year to month; > ^ > postgres=# select interval '2 00:00:01' day to second; > interval > - > 2 days 00:00:01 > (1 row) > postgres=# select interval '- 2 00:00:01' day to second; > interval > --- > -2 days +00:00:01 > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31078) outputOrdering should handle aliases correctly
[ https://issues.apache.org/jira/browse/SPARK-31078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31078. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27842 [https://github.com/apache/spark/pull/27842] > outputOrdering should handle aliases correctly > -- > > Key: SPARK-31078 > URL: https://issues.apache.org/jira/browse/SPARK-31078 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.1.0 > > > Currently, `outputOrdering` doesn't respect aliases. Thus, the following > would produce an unnecessary sort node: > {code:java} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { > val df = (0 until 20).toDF("i").as("df") > df.repartition(8, df("i")).write.format("parquet") > .bucketBy(8, "i").sortBy("i").saveAsTable("t") > val t1 = spark.table("t") > val t2 = t1.selectExpr("i as ii") > t1.join(t2, t1("i") === t2("ii")).explain > } > {code} > would produce an unnecessary sort node: > {code:java} > == Physical Plan == > *(3) SortMergeJoin [i#8], [ii#10], Inner > :- *(1) Project [i#8] > : +- *(1) Filter isnotnull(i#8) > : +- *(1) ColumnarToRow > :+- FileScan parquet default.t[i#8] Batched: true, DataFilters: > [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., > PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: > struct, SelectedBucketsCount: 8 out of 8 > +- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0 >+- *(2) Project [i#8 AS ii#10] > +- *(2) Filter isnotnull(i#8) > +- *(2) ColumnarToRow > +- FileScan parquet default.t[i#8] Batched: true, DataFilters: > [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., > PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: > struct, SelectedBucketsCount: 8 out of 8 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31078) outputOrdering should handle aliases correctly
[ https://issues.apache.org/jira/browse/SPARK-31078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31078: --- Assignee: Terry Kim > outputOrdering should handle aliases correctly > -- > > Key: SPARK-31078 > URL: https://issues.apache.org/jira/browse/SPARK-31078 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Currently, `outputOrdering` doesn't respect aliases. Thus, the following > would produce an unnecessary sort node: > {code:java} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { > val df = (0 until 20).toDF("i").as("df") > df.repartition(8, df("i")).write.format("parquet") > .bucketBy(8, "i").sortBy("i").saveAsTable("t") > val t1 = spark.table("t") > val t2 = t1.selectExpr("i as ii") > t1.join(t2, t1("i") === t2("ii")).explain > } > {code} > would produce an unnecessary sort node: > {code:java} > == Physical Plan == > *(3) SortMergeJoin [i#8], [ii#10], Inner > :- *(1) Project [i#8] > : +- *(1) Filter isnotnull(i#8) > : +- *(1) ColumnarToRow > :+- FileScan parquet default.t[i#8] Batched: true, DataFilters: > [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., > PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: > struct, SelectedBucketsCount: 8 out of 8 > +- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0 >+- *(2) Project [i#8 AS ii#10] > +- *(2) Filter isnotnull(i#8) > +- *(2) ColumnarToRow > +- FileScan parquet default.t[i#8] Batched: true, DataFilters: > [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., > PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: > struct, SelectedBucketsCount: 8 out of 8 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31079) Add RuleExecutor metrics in Explain Formatted
[ https://issues.apache.org/jira/browse/SPARK-31079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31079: --- Assignee: Xin Wu > Add RuleExecutor metrics in Explain Formatted > - > > Key: SPARK-31079 > URL: https://issues.apache.org/jira/browse/SPARK-31079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xin Wu >Assignee: Xin Wu >Priority: Major > > RuleExecutor already support metering for analyzer/optimizer. By providing > such information in Explain command, user can get better user experience when > debugging a specific query. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31079) Add RuleExecutor metrics in Explain Formatted
[ https://issues.apache.org/jira/browse/SPARK-31079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31079. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27846 [https://github.com/apache/spark/pull/27846] > Add RuleExecutor metrics in Explain Formatted > - > > Key: SPARK-31079 > URL: https://issues.apache.org/jira/browse/SPARK-31079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xin Wu >Assignee: Xin Wu >Priority: Major > Fix For: 3.0.0 > > > RuleExecutor already support metering for analyzer/optimizer. By providing > such information in Explain command, user can get better user experience when > debugging a specific query. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31109) Add version information to the configuration of Mesos
jiaan.geng created SPARK-31109: -- Summary: Add version information to the configuration of Mesos Key: SPARK-31109 URL: https://issues.apache.org/jira/browse/SPARK-31109 Project: Spark Issue Type: Sub-task Components: Mesos Affects Versions: 3.1.0 Reporter: jiaan.geng esource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31108) Parameter cannot be passed to pandas udf of type map_iter
xge created SPARK-31108: --- Summary: Parameter cannot be passed to pandas udf of type map_iter Key: SPARK-31108 URL: https://issues.apache.org/jira/browse/SPARK-31108 Project: Spark Issue Type: Question Components: Examples Affects Versions: 3.0.0 Reporter: xge Parameters can only be passed in the following way: from pyspark.sql.functions import pandas_udf, PandasUDFType def map_iter_pandas_udf_example(spark): strr = "abcd df = spark.createDataFrame([(1, 21),(2,30)],("id", "age")) @pandas_udf(df.schema, PandasUDFType.MAP_ITER) def filter_func(batch_iter, x = strr): print(x) for pdf in batch_iter: yield pdf[pdf.id == 1] df.mapInPandas(filter_func).show() *** However, if the code edited as follow, error ccurred: *** from pyspark.sql.functions import pandas_udf, PandasUDFType def map_iter_pandas_udf_example(spark): strr = "abcd df = spark.createDataFrame([(1, 21),(2,30)],("id", "age")) @pandas_udf(df.schema, PandasUDFType.MAP_ITER) def filter_func(batch_iter, x = strr): print(x) for pdf in batch_iter: yield pdf[pdf.id == 1] data = "dbca" df.mapInPandas(filter_func(data)).show() *** ValueError: Invalid udf: the udf argument must be a pandas_udf of type MAP_ITER. Does anyone know if pandas udf of type map_iter can pass parameters, and if so, how to write the code? Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055786#comment-17055786 ] Dongjoon Hyun commented on SPARK-31098: --- Hmm. [~Gengliang.Wang]. Unfortunately, this seems to be risky in `branch-2.4`. Shall we close this because SPARK-27034 supersedes this already in 3.0? > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) >
[jira] [Created] (SPARK-31107) Extend FairScheduler to support pool level resource isolation
liupengcheng created SPARK-31107: Summary: Extend FairScheduler to support pool level resource isolation Key: SPARK-31107 URL: https://issues.apache.org/jira/browse/SPARK-31107 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: liupengcheng Currently, spark only provided two types of scheduler: FIFO & FAIR, but in sql high-concurrency scenarios, a few of drawbacks are exposed. FIFO: it can easily causing congestion when large sql query occupies all the resources FAIR: the taskSets of one pool may occupies all the resource due to there are no hard limit on the maximum usage for each pool. this case may be frequently met under high workloads. So we propose to add a maxShare argument for FairScheduler to control the maximum running tasks for each pool. One thing that needs our attention is that we should handle it well to make the `ExecutorAllocationManager` can release resources: e.g. Suppose we got 100 executors, if the tasks are scheduled on all executors with max concurrency 50, there are cases that the executors may not idle, and can not be released. One idea is to bind those executors to each pool, then we only schedule tasks on executors of the pool which it belongs to. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31106) Support IS_JSON
[ https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055757#comment-17055757 ] Rakesh Raushan commented on SPARK-31106: I am working on it. > Support IS_JSON > --- > > Key: SPARK-31106 > URL: https://issues.apache.org/jira/browse/SPARK-31106 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Currently, null is returned when we come across invalid json. We should > either throw an exception for invalid json or false should be returned, like > in other DBMSs. Like in `json_array_length` function we need to return NULL > for null array. So this might confuse users. > > DBMSs supporting this functions are : > * MySQL > * SQL Server > * Sqlite > * MariaDB -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31106) Support IS_JSON
Rakesh Raushan created SPARK-31106: -- Summary: Support IS_JSON Key: SPARK-31106 URL: https://issues.apache.org/jira/browse/SPARK-31106 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan Currently, null is returned when we come across invalid json. We should either throw an exception for invalid json or false should be returned, like in other DBMSs. Like in `json_array_length` function we need to return NULL for null array. So this might confuse users. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31105) Respect sql execution id when scheduling taskSets
liupengcheng created SPARK-31105: Summary: Respect sql execution id when scheduling taskSets Key: SPARK-31105 URL: https://issues.apache.org/jira/browse/SPARK-31105 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: liupengcheng Currently, spark will sort taskSets by jobId and stageId and then schedule them in order for FIFO schedulingMode. In OLAP senerios, especially under high concurrency, the taskSets are always from different sql queries and several jobs can be submitted for execution at one time for one query for adaptive execution. But now we order those taskSets without considering the execution group, which may causes the query being delayed. So I propose to consider the sql execution id when scheduling jobs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055742#comment-17055742 ] Dongjoon Hyun commented on SPARK-31098: --- I'll make a small bug fix for this use case to prevent exception at least. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Comment Edited] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055731#comment-17055731 ] Dongjoon Hyun edited comment on SPARK-31098 at 3/10/20, 9:26 AM: - -Hmm. It seems that there is more patches for this in addition to that. Let me dig more.- SPARK-27034 is correct. You need the following especially to backport what you want. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L167-L169 was (Author: dongjoon): ~Hmm. It seems that there is more patches for this in addition to that. Let me dig more.~ SPARK-27034 is correct. You need the following especially to backport what you want. - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L167-L169 > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost
[jira] [Comment Edited] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055731#comment-17055731 ] Dongjoon Hyun edited comment on SPARK-31098 at 3/10/20, 9:26 AM: - ~Hmm. It seems that there is more patches for this in addition to that. Let me dig more.~ SPARK-27034 is correct. You need the following especially to backport what you want. - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L167-L169 was (Author: dongjoon): Hmm. It seems that there is more patches for this in addition to that. Let me dig more. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at >
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055731#comment-17055731 ] Dongjoon Hyun commented on SPARK-31098: --- Hmm. It seems that there is more patches for this in addition to that. Let me dig more. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Created] (SPARK-31104) Add documentation for all the Json Functions
Rakesh Raushan created SPARK-31104: -- Summary: Add documentation for all the Json Functions Key: SPARK-31104 URL: https://issues.apache.org/jira/browse/SPARK-31104 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055724#comment-17055724 ] Dongjoon Hyun commented on SPARK-31098: --- And, SPARK-27034 is the fix for this case in 3.0. As you see, this is `struct`. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Created] (SPARK-31103) Extend Support for useful JSON Functions
Rakesh Raushan created SPARK-31103: -- Summary: Extend Support for useful JSON Functions Key: SPARK-31103 URL: https://issues.apache.org/jira/browse/SPARK-31103 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan Currently, Spark only supports few functions for JSON. There are many other common utility functions which are supported by other popular DBMSs. Supporting these functions will make it easier for prospective users. Also some functions like `json_array_length` , `json_object_keys` are more intuitive and naive users life would be much simpler. I have added some JSON functions on which I am working on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30992) Arrange scattered config of streaming module
[ https://issues.apache.org/jira/browse/SPARK-30992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30992: Assignee: jiaan.geng > Arrange scattered config of streaming module > > > Key: SPARK-30992 > URL: https://issues.apache.org/jira/browse/SPARK-30992 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > I found a lot scattered config in Streaming module. > I think should arrange these config in unified position. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30992) Arrange scattered config of streaming module
[ https://issues.apache.org/jira/browse/SPARK-30992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30992. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27744 [https://github.com/apache/spark/pull/27744] > Arrange scattered config of streaming module > > > Key: SPARK-30992 > URL: https://issues.apache.org/jira/browse/SPARK-30992 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.0.0 > > > I found a lot scattered config in Streaming module. > I think should arrange these config in unified position. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055712#comment-17055712 ] Gengliang Wang commented on SPARK-31098: [~dongjoon] Yes, one of the files missing the column `a5` > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Closed] (SPARK-30784) Hive 2.3 profile should still use orc-nohive
[ https://issues.apache.org/jira/browse/SPARK-30784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-30784. - > Hive 2.3 profile should still use orc-nohive > > > Key: SPARK-30784 > URL: https://issues.apache.org/jira/browse/SPARK-30784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yin Huai >Priority: Critical > > Originally reported at > [https://github.com/apache/spark/pull/26619#issuecomment-583802901] > > Right now, Hive 2.3 profile pulls in regular orc, which depends on > hive-storage-api. However, hive-storage-api and hive-common have the > following common class files > > org/apache/hadoop/hive/common/ValidReadTxnList.class > org/apache/hadoop/hive/common/ValidTxnList.class > org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class > For example, > [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] > (pulled in by orc 1.5.8) and > [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] > (from hive-common 2.3.6) both are in the classpath and they are different. > Having both versions in the classpath can cause unexpected behavior due to > classloading order. We should still use orc-nohive, which has > hive-storage-api shaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()
[ https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-30443: -- Affects Version/s: 3.0.0 > "Managed memory leak detected" even with no calls to take() or limit() > -- > > Key: SPARK-30443 > URL: https://issues.apache.org/jira/browse/SPARK-30443 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.4, 3.0.0 >Reporter: Luke Richter >Priority: Major > Attachments: a.csv.zip, b.csv.zip, c.csv.zip > > > Our Spark code is causing a "Managed memory leak detected" warning to appear, > even though we are not calling take() or limit(). > According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 > managed memory leaks should only be caused by not reading an iterator to > completion, i.e. take() or limit() > Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed > memory leak detected; size = 2097152 bytes, TID = 118" > The size of the managed memory leak is always 2MB. > I have created a minimal test program that reproduces the warning: > {code:java} > import pyspark.sql > import pyspark.sql.functions as fx > def main(): > builder = pyspark.sql.SparkSession.builder > builder = builder.appName("spark-jira") > spark = builder.getOrCreate() > reader = spark.read > reader = reader.format("csv") > reader = reader.option("inferSchema", "true") > reader = reader.option("header", "true") > table_c = reader.load("c.csv") > table_a = reader.load("a.csv") > table_b = reader.load("b.csv") > primary_filter = fx.col("some_code").isNull() > new_primary_data = table_a.filter(primary_filter) > new_ids = new_primary_data.select("some_id") > new_data = table_b.join(new_ids, "some_id") > new_data = new_data.select("some_id") > result = table_c.join(new_data, "some_id", "left") > result.repartition(1).write.json("results.json", mode="overwrite") > spark.stop() > if __name__ == "__main__": > main() > {code} > Our code isn't anything out of the ordinary, just some filters, selects and > joins. > The input data is made up of 3 CSV files. The input data files are quite > large, roughly 2.6GB in total uncompressed. I attempted to reduce the number > of rows in the CSV input files but this caused the warning to no longer > appear. After compressing the files I was able to attach them below. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055696#comment-17055696 ] Dongjoon Hyun edited comment on SPARK-31098 at 3/10/20, 8:37 AM: - I guess your expectation is the behavior of `mergeSchema`, isn't it? One file is missing column `a5`. was (Author: dongjoon): I guess your expectation is the behavior of `mergeSchema`, isn't it? > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at >
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055696#comment-17055696 ] Dongjoon Hyun commented on SPARK-31098: --- I guess your expectation is the behavior of `mergeSchema`, isn't it? > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055693#comment-17055693 ] Dongjoon Hyun commented on SPARK-31098: --- Hi, [~Gengliang.Wang]. First of all, even in 3.0.0, the result schema depends on the order of files (technically, the size of files because Spark do sorting.) {code} scala> spark.read.orc("/tmp/orc/2019-01-02", "/tmp/orc/2019-01-01").printSchema root |-- a: struct (nullable = true) ||-- a1: integer (nullable = true) ||-- a2: string (nullable = true) ||-- a3: timestamp (nullable = true) ||-- a4: string (nullable = true) ||-- a5: integer (nullable = true) |-- b: struct (nullable = true) ||-- b1: integer (nullable = true) ||-- b2: string (nullable = true) scala> spark.read.orc("/tmp/orc/2019-01-01", "/tmp/orc/2019-01-02").printSchema root |-- a: struct (nullable = true) ||-- a1: integer (nullable = true) ||-- a2: string (nullable = true) ||-- a3: timestamp (nullable = true) ||-- a4: string (nullable = true) |-- b: struct (nullable = true) ||-- b1: integer (nullable = true) ||-- b2: string (nullable = true) scala> spark.version res11: String = 3.0.0-preview2 {code} So, to be consistent, `mergeSchema` is the only solution. {code} scala> spark.read.option("mergeSchema", "true").orc("/tmp/orc/2019-01-01", "/tmp/orc/2019-01-02").printSchema root |-- a: struct (nullable = true) ||-- a1: integer (nullable = true) ||-- a2: string (nullable = true) ||-- a3: timestamp (nullable = true) ||-- a4: string (nullable = true) ||-- a5: integer (nullable = true) |-- b: struct (nullable = true) ||-- b1: integer (nullable = true) ||-- b2: string (nullable = true) {code} > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at >
[jira] [Updated] (SPARK-11412) Support merge schema for ORC
[ https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-11412: -- Affects Version/s: 2.3.4 2.4.5 > Support merge schema for ORC > > > Key: SPARK-11412 > URL: https://issues.apache.org/jira/browse/SPARK-11412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.0, 2.1.1, 2.2.0, 2.3.4, 2.4.5 >Reporter: Dave >Assignee: EdisonWang >Priority: Major > Fix For: 3.0.0 > > > when I tried to load partitioned orc files with a slight difference in a > nested column. say > column > -- request: struct (nullable = true) > ||-- datetime: string (nullable = true) > ||-- host: string (nullable = true) > ||-- ip: string (nullable = true) > ||-- referer: string (nullable = true) > ||-- request_uri: string (nullable = true) > ||-- uri: string (nullable = true) > ||-- useragent: string (nullable = true) > And then there's a page_url_lists attributes in the later partitions. > I tried to use > val s = sqlContext.read.format("orc").option("mergeSchema", > "true").load("/data/warehouse/") to load the data. > But the schema doesn't show request.page_url_lists. > I am wondering if schema merge doesn't work for orc? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31098) Reading ORC files throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-31098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055672#comment-17055672 ] Dongjoon Hyun commented on SPARK-31098: --- Thank you for pinging me and the file. Let me take a look. > Reading ORC files throws IndexOutOfBoundsException > -- > > Key: SPARK-31098 > URL: https://issues.apache.org/jira/browse/SPARK-31098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5 >Reporter: Gengliang Wang >Priority: Major > Attachments: files.tar > > > On reading the attached ORC file which contains null value in nested field, > there is such exception: > {code:java} > scala> spark.read.orc("/tmp/files/").show() > 20/03/06 19:01:34 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/03/06 19:01:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, > localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 4 > at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133) > at > org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123) > at >
[jira] [Resolved] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31065. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27854 [https://github.com/apache/spark/pull/27854] > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0 >Reporter: Nicholas Chammas >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', > 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', > 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', > 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) > == SQL == > struct > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) > at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) > at >
[jira] [Assigned] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31065: - Assignee: Hyukjin Kwon > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0 >Reporter: Nicholas Chammas >Assignee: Hyukjin Kwon >Priority: Minor > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', > 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', > 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', > 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) > == SQL == > struct > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) > at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) > at > org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777) > at >
[jira] [Commented] (SPARK-30707) Lead/Lag window function throws AnalysisException without ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-30707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055642#comment-17055642 ] angerszhu commented on SPARK-30707: --- add pr in [https://github.com/apache/spark/pull/27861] > Lead/Lag window function throws AnalysisException without ORDER BY clause > - > > Key: SPARK-30707 > URL: https://issues.apache.org/jira/browse/SPARK-30707 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > Lead/Lag window function throws AnalysisException without ORDER BY clause: > {code:java} > SELECT lead(ten, four + 1) OVER (PARTITION BY four), ten, four > FROM (SELECT * FROM tenk1 WHERE unique2 < 10 ORDER BY four, ten)s > org.apache.spark.sql.AnalysisException > Window function lead(ten#x, (four#x + 1), null) requires window to be > ordered, please add ORDER BY clause. For example SELECT lead(ten#x, (four#x + > 1), null)(value_expr) OVER (PARTITION BY window_partition ORDER BY > window_ordering) from table; > {code} > > Maybe we need fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31096) Replace `Array` with `Seq` in AQE `CustomShuffleReaderExec`
[ https://issues.apache.org/jira/browse/SPARK-31096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31096. - Fix Version/s: 3.0.0 Assignee: Wei Xue Resolution: Fixed > Replace `Array` with `Seq` in AQE `CustomShuffleReaderExec` > --- > > Key: SPARK-31096 > URL: https://issues.apache.org/jira/browse/SPARK-31096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6
[ https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29183: -- Summary: Upgrade JDK 11 Installation to 11.0.6 (was: Upgrade JDK 11 Installation to 11.0.4) > Upgrade JDK 11 Installation to 11.0.6 > - > > Key: SPARK-29183 > URL: https://issues.apache.org/jira/browse/SPARK-29183 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Every JDK 11.0.x releases have many fixes including performance regression > fix. We had better upgrade it to the latest 11.0.4. > - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29199) Add linters and license/dependency checkers to GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-29199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29199. --- Fix Version/s: 3.0.0 Assignee: Dongjoon Hyun Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/25879 > Add linters and license/dependency checkers to GitHub Action > > > Key: SPARK-29199 > URL: https://issues.apache.org/jira/browse/SPARK-29199 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org